ymcki commented 6 months ago

Issue Report

Please describe the issue:

I am running benchmark to compare performance of 0.3.x, 0.4.x and 0.5.x with both 4.2 and 4.3 sup model. Since 0.3.4 doesn't support 4.3 model, so I didn't run it. Testing data I used is: s3://ont-open-data/giab_2023.05/flowcells/hg002/20230429_1600_2E_PAO83395_124388f5/ It is chosen as its throughput is quite close to what I got routinely at my work place. The command I ran is like: ./dorado basecaller -r --modified-bases-models dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mCG_5hmCG@v2 dna_r10.4.1_e8.2_400bps_sup@v4.2.0 ./ont-open-data/giab_2023.05/hg002/20230429_1600_2E_PAO83395_124388f5/ > HG002.ubam To measure base accuracy, I aligned all the reads to the HG002v1.0.1 diploid genome which is claimed to be Q75.

Please provide a clear and concise description of the issue you are seeing and the result you expect. Observations and Questions:

Surprisingly, 0.3.4 generates 10% longer reads than 0.4.3 and 0.5.3 and 3x more >=50K reads. I think this is more important than higher base accuracy especially for assembly and SV calling. Why does the reads getting shorter with newer dorado? Based on my experience of ONT data, there is a small percentage of reads that are concatenation of two strands of the same molecule. Does 0.4.3 and 0.5.3 has ways to combine them or cut them off such that the reads are shorter and the longer 0.3.4 reads are actually useless?
4.3 model performs better than 4.2 model in both 0.4.3 and 0.5.3. Most notable is that for the qs<10 reads, error rate reduces from 17% to 10%. Perhaps there is room to lower to qs cutoff?
Interestingly, dorado 0.4.3 performs slightly better than 0.5.3 in accuracy. What happened?

Here are the numbers: <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">

Sample	0.3.4sup4.2	0.4.3sup4.2	0.4.3sup4.3	0.5.3sup4.2	0.5.3sup4.3
throughput	136,171,813,898	136,163,677,918	136,911,563,222	136,056,361,146	136,829,736,921
reads#	6,098,647	6,838,511	6,756,725	6,797,810	6,792,511
Avg Read Length	22,328.20	19,911.30	20,263.01	20,014.73	20,144.21
Longest Read	1,235,205	1,235,107	1,169,268	1,235,562	1,169,627
N50	31,497	28,494	28,795	28,587	28,691
L50	1,480,764	1,894,854	1,842,836	1,870,156	1,862,148
>=50K bases	31,356,856,556	9,152,175,262	12,326,674,738	10,350,709,902	11,288,671,398
>=100K bases	567,342,104	218,324,013	259,793,618	239,468,193	253,731,275
>=900K	0.000033%	0.000015%	0.000030%	0.000015%	0.000029%
>=500K	0.000426%	0.000351%	0.000237%	0.000382%	0.000236%
>=400K	0.000771%	0.000600%	0.000459%	0.000662%	0.000456%
>=300K	0.001427%	0.001126%	0.000814%	0.001206%	0.000810%
>=200K	0.003329%	0.002691%	0.002398%	0.002795%	0.002459%
>=100K	0.073574%	0.021715%	0.028224%	0.024302%	0.027192%
Mean Reported Base Quality	35.55073319	35.55122382	35.70586218	35.56471732	35.70625785
Aligned (qs>=10)	5,652,805	6,378,039	6,136,224	6,338,151	6,170,386
Unmapped (qs>=10)	9,951	11,207	10,923	11,266	10,961
Aligned (qs\<10)	374,496	384,302	544,008	383,422	543,983
Unmapped (qs\<10)	61,397	64,963	65,570	64,971	67,181
Bases Aligned (qs>=10)	128,077,764,974	128,021,541,239	127,811,026,955	127,910,628,000	127,739,503,107
%mismatch (qs>=10)	1.1271%	1.2032%	1.0342%	1.2016%	1.1257%
%mismatch (qs\<10)	17.3097%	17.5856%	10.3937%	17.5576%	10.4031%

Steps to reproduce the issue:

Please list any steps to reproduce the issue.

Run environment:

Dorado version: 0.3.4, 0.4.3, 0.5.3
Dorado command:
Operating system: Ubuntu 20.04
Hardware (CPUs, Memory, GPUs): 4xA100 80GB
Source data type (e.g., pod5 or fast5 - please note we always recommend converting to pod5 for optimal basecalling performance):
Source data location (on device or networked drive - NFS, etc.):
Details about data (flow cell, kit, read lengths, number of reads, total dataset size in MB/GB/TB):
Dataset to reproduce, if applicable (small subset of data to share as a pod5 to reproduce the issue):

Logs

Please provide output trace of dorado (run dorado with -v, or -vv on a small subset)

susie-ont commented 6 months ago

Hi @ymcki,

What is happening here is most likely that you are seeing the effect of read splitting which was enabled in Dorado v0.4.0. In Dorado v0.3.4 and earlier, reads were not split - this means is that occasionally two simplex reads would be called as one read (biologically, this happens because two reads follow each other very quickly in a pore).

So, v0.4.0 and above are likely giving you the correct N50, L50 etc. This is why the number of bases aligned is very similar but the L50 and N50 are different.

ymcki commented 6 months ago

Hi @ymcki,

What is happening here is most likely that you are seeing the effect of read splitting which was enabled in Dorado v0.4.0. In Dorado v0.3.4 and earlier, reads were not split - this means is that occasionally two simplex reads would be called as one read (biologically, this happens because two reads follow each other very quickly in a pore).

So, v0.4.0 and above are likely giving you the correct N50, L50 etc. This is why the number of bases aligned is very similar but the L50 and N50 are different.

Thanks for your detailed explanation. I think that explains most of it.

Do you know there are special handling for reads that are concatenation of two strands of the same molecule? I still saw them in 0.5.x data. So is it being handled but missed a small number of them or not handled at all for now?

A reply in this thread says 4.3 model is not optimized for 0.4.x https://github.com/nanoporetech/dorado/issues/691 Yet I am getting slightly better numbers for 0.4.x than 0.5.x. What's going on? Is it possible to tune 0.5.x to make it perform closer to 0.4.x?

davidnewman02 commented 6 months ago

Hi @ymcki

There was one further functional change between Dorado v0.4.3 and v0.5.3, which was enabling adapter trimming (the full change list is available here: https://github.com/nanoporetech/dorado/blob/release-v0.5.3/CHANGELOG.md ).

In your results there are slightly more alignments in Dorado-v0.5.3, which may be due some additional reads crossing the alignment_score threshold now that the (unalignable) adapter sequence is excluded during alignment. The reads which sit very close to the alignment classification threshold are likely to be lower accuracy, which I believe is what is affecting your mismatch rate.

I would expect that taking results only from the reads which align with both v0.4.3/v0.5.3 will show very similar results. Alternatively Dorado provides a --no-trim parameter which disables this behaviour; although we do not expect this to be necessary for the majority of workflows.

NB: although the version names for the models (v4.3.0) and Dorado versions (v0.4.3/v0.5.3) are similar, this is by coincidence and there are no specific optimisations to match similarly named model/ basecaller versions.

ymcki commented 6 months ago

As a continuation of my benchmarking work, I looked into the alignments of the longest reads. Typically, I found that the so-called the "longest" reads (ie >200Kbp) are either a concatenation of many reads (they cannot be structural variations as I am aligning HG002 reads to its T2T genome) or a concatenation of forward and reverse strands of the same read. To measure the "true" read lengths, I only consider the primary and supplementary reads with flags 0,16,2048 or 2064 and use pysam's CIGAR parser to calculate "true" read length by taking the sum of numbers of 'M' and 'I'.

Then I get this table for the "true" read lengths and their NA50.

Sample	0.3.4sup4.2	0.4.3sup4.2	0.4.3sup4.3	0.5.3sup4.2	0.5.3sup4.3
Aligned (qs>=10)	5,652,805	6,378,039	6,136,224	6,338,151	6,170,386
Unmapped (qs>=10)	9,951	11,207	10,923	11,266	10,961
Pri/Supp Aln bases (qs>=10)	127,634,045,864	127,537,722,176	127,366,784,651	127,544,077,062	127,405,621,809
Pri/Supp reads (qs>=10)	6,668,152	6,654,021	6,505,974	6,654,067	6,505,957
Avg Length (qs>=10)	19,140.84	19,167.02	19,576.90	19,167.84	19,582.92
Longest Aligned Read (qs>=10)	122,807	112,117	122,886	112,110	112,153
NA50 (qs>=10)	27,839	27,838	27,906	27,840	27,914
LA50 (qs>=10)	1,927,543	1,926,461	1,920,442	1,926,442	1,920,626
Aligned (qs\<10)	374,496	384,302	544,008	383,422	543,983
Unmapped (qs\<10)	61,397	64,963	65,570	64,971	67,181
Pri/Supp Aln bases (qs\<10)	7,437,748,683	7,498,638,603	7,958,129,983	7,503,655,339	7,961,015,204
Pri/Supp reads (qs\<10)	460,428	470,660	632,768	470,556	633,305
Avg Length (qs\<10)	16,153.99	15,932.18	12,576.69	15,946.36	12,570.59
Longest Aligned Read (qs\<10)	93,575	93,523	94,235	93,587	94,171
NA50 (qs\<10)	26,202	26,169	25,019	26,176	25,020
LA50 (qs\<10)	118,454	119,521	130,134	119,587	130,162

Since I consider each supplementary read as a different read, so the total number of reads is more than before. We can see number of reads jumped significantly for 0.3.4. Most likely due to no or lack of read splitting. On the other hand, 4.3.0 model has a bigger jump than 4.2.0 model, so perhaps read splitting for 4.3.0 model is not as optimized as 4.2.0 model?

Interestingly, average "true" read length is ~2.2% longer for 4.3.0 model than 4.2.0 model for reads with qs>=10. However, NA50 seems to be the same for all five cases. On the other hand, for qs<10, 4.3.0 models are shorter than 4.2.0 models for both average "true" read length and NA50.

So the conclusion of all these exercises is that 4.3.0 model is better than 4.2.0 model. However, 0.4.3 dorado seems to have an slight edge over 0.5.3.

nanoporetech / dorado

0.3.x generates reads 10% longer than 0.4.x and 0.5.x??? #702

Issue Report

Please describe the issue:

Steps to reproduce the issue:

Run environment:

Logs