MinKNOW run report read length estimates are way off for longer reads

katlavigneflug commented 3 weeks ago

I recently did a run with the following configuration and output:

I base called and demultiplexed this data using the following script run on a Linux-based cluster:

module load cuda/12.2.1
../dorado-0.5.3-linux-x64/bin/dorado basecaller dna_r10.4.1_e8.2_400bps_hac@v4.3.0 pod5s/ --modified-bases 5mCG_5hmCG --kit-name SQK-NBD114-24 > calls.bam
../dorado-0.5.3-linux-x64/bin/dorado demux --output-dir demultiplexed --no-classify calls.bam

However, in the output file, I noticed that the distribution of read lengths is very different from the estimate provided by the MinKNOW UI: download-23

I first assumed this was a problem with the base calling itself. I confirmed that the discrepancy was not due to failure to base call reads, as the number of reads in the dorado output (1175) matched the number in the minKNOW summary table (1176) minus one that was filtered (I haven't had the chance to figure out why, but I confirmed that this read was not particularly long).

I read in the dorado github that other users had noticed read length discrepancies due to splitting of concatemers and/or reads containing internal adapters. This manifested itself as the appearance of new read IDs in the dorado output file that were not present in the minKNOW pod5 file. I confirmed that this wasn't the case for me. All read IDs in the dorado output file are present in the pod5 file and vice versa, minus the single read that was filtered.

Many resolved issues on the dorado github mentioned problems with demultiplexing and barcode and/or adapter trimming that have been addressed in the most recent release (v0.7.1). So, to cover my bases, I reran my base calling script with this dorado version instead. There were slight differences in summary statistics, as expected, but no major difference in read length distribution that could explain the discrepancy with minKNOW.

Finally, I attempted to perform a read-by-read comparison of lengths from the minKNOW UI and the dorado output. Based on my quick search, it appeared that minKNOW's method for calculating read length (correct me if I'm wrong) in the absence of real time base calling is to multiply read duration by translocation speed (400bp/s in my case). I performed this calculation to get the read-by-read length estimation from the minKNOW summary table and then plotted this against the read length determined by dorado. I also grouped reads by the end reason.

download-24

Strikingly, the correlation to the line y=x drops severely after around 14kb. I guess this is not the most unexpected given the method of estimating read length, but I am surprised at the large discrepancy (8.97 Mb total estimated by minKNOW and only 2.74 Mb called by dorado).

I am wondering if anyone else has encountered this problem and if there are any known workarounds to get more accurate estimates on the minKNOW UI side. We were super excited to see a read almost 1 Mb long until the analysis showed it was actually under 200 kb. Hoping not to get our hopes up again :(

MarkusRainerSchmidt commented 2 weeks ago

I think the same thing is happening to us. You saved me a lot of work figuring out where these discrepancies come from. Thanks for looking at this in so much detail!

katlavigneflug commented 2 weeks ago

Glad to hear we're not the only ones, and happy to share our insights! I was surprised that I couldn't find any mention of this elsewhere, but I'm assuming that's because most folks do use real time base calling, which then enables tracking of actual translocation speed. Hopefully this can get resolved!

0x55555555 commented 2 weeks ago

Hi @katlavigneflug ,

Yes you're right on MinKNOW's behaviour when basecalling is not enabled. I'll take the issue to our internal development teams and discuss.

Thanks,

George

nanoporetech / minknow_api

MinKNOW run report read length estimates are way off for longer reads #67