nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
495 stars 59 forks source link

0.3.x generates reads 10% longer than 0.4.x and 0.5.x??? #702

Closed ymcki closed 2 weeks ago

ymcki commented 6 months ago

Issue Report

Please describe the issue:

I am running benchmark to compare performance of 0.3.x, 0.4.x and 0.5.x with both 4.2 and 4.3 sup model. Since 0.3.4 doesn't support 4.3 model, so I didn't run it. Testing data I used is: s3://ont-open-data/giab_2023.05/flowcells/hg002/20230429_1600_2E_PAO83395_124388f5/ It is chosen as its throughput is quite close to what I got routinely at my work place. The command I ran is like: ./dorado basecaller -r --modified-bases-models dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mCG_5hmCG@v2 dna_r10.4.1_e8.2_400bps_sup@v4.2.0 ./ont-open-data/giab_2023.05/hg002/20230429_1600_2E_PAO83395_124388f5/ > HG002.ubam To measure base accuracy, I aligned all the reads to the HG002v1.0.1 diploid genome which is claimed to be Q75.

Please provide a clear and concise description of the issue you are seeing and the result you expect. Observations and Questions:

  1. Surprisingly, 0.3.4 generates 10% longer reads than 0.4.3 and 0.5.3 and 3x more >=50K reads. I think this is more important than higher base accuracy especially for assembly and SV calling. Why does the reads getting shorter with newer dorado? Based on my experience of ONT data, there is a small percentage of reads that are concatenation of two strands of the same molecule. Does 0.4.3 and 0.5.3 has ways to combine them or cut them off such that the reads are shorter and the longer 0.3.4 reads are actually useless?
  2. 4.3 model performs better than 4.2 model in both 0.4.3 and 0.5.3. Most notable is that for the qs<10 reads, error rate reduces from 17% to 10%. Perhaps there is room to lower to qs cutoff?
  3. Interestingly, dorado 0.4.3 performs slightly better than 0.5.3 in accuracy. What happened?

Here are the numbers: <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">

Sample 0.3.4sup4.2 0.4.3sup4.2 0.4.3sup4.3 0.5.3sup4.2 0.5.3sup4.3
throughput 136,171,813,898 136,163,677,918 136,911,563,222 136,056,361,146 136,829,736,921
reads# 6,098,647 6,838,511 6,756,725 6,797,810 6,792,511
Avg Read Length 22,328.20 19,911.30 20,263.01 20,014.73 20,144.21
Longest Read 1,235,205 1,235,107 1,169,268 1,235,562 1,169,627
N50 31,497 28,494 28,795 28,587 28,691
L50 1,480,764 1,894,854 1,842,836 1,870,156 1,862,148
>=50K bases 31,356,856,556 9,152,175,262 12,326,674,738 10,350,709,902 11,288,671,398
>=100K bases 567,342,104 218,324,013 259,793,618 239,468,193 253,731,275
>=900K 0.000033% 0.000015% 0.000030% 0.000015% 0.000029%
>=500K 0.000426% 0.000351% 0.000237% 0.000382% 0.000236%
>=400K 0.000771% 0.000600% 0.000459% 0.000662% 0.000456%
>=300K 0.001427% 0.001126% 0.000814% 0.001206% 0.000810%
>=200K 0.003329% 0.002691% 0.002398% 0.002795% 0.002459%
>=100K 0.073574% 0.021715% 0.028224% 0.024302% 0.027192%
Mean Reported Base Quality 35.55073319 35.55122382 35.70586218 35.56471732 35.70625785
Aligned (qs>=10) 5,652,805 6,378,039 6,136,224 6,338,151 6,170,386
Unmapped (qs>=10) 9,951 11,207 10,923 11,266 10,961
Aligned (qs\<10) 374,496 384,302 544,008 383,422 543,983
Unmapped (qs\<10) 61,397 64,963 65,570 64,971 67,181
Bases Aligned (qs>=10) 128,077,764,974 128,021,541,239 127,811,026,955 127,910,628,000 127,739,503,107
%mismatch (qs>=10) 1.1271% 1.2032% 1.0342% 1.2016% 1.1257%
%mismatch (qs\<10) 17.3097% 17.5856% 10.3937% 17.5576% 10.4031%

Steps to reproduce the issue:

Please list any steps to reproduce the issue.

Run environment:

Logs

susie-ont commented 6 months ago

Hi @ymcki,

What is happening here is most likely that you are seeing the effect of read splitting which was enabled in Dorado v0.4.0. In Dorado v0.3.4 and earlier, reads were not split - this means is that occasionally two simplex reads would be called as one read (biologically, this happens because two reads follow each other very quickly in a pore).

So, v0.4.0 and above are likely giving you the correct N50, L50 etc. This is why the number of bases aligned is very similar but the L50 and N50 are different.

ymcki commented 6 months ago

Hi @ymcki,

What is happening here is most likely that you are seeing the effect of read splitting which was enabled in Dorado v0.4.0. In Dorado v0.3.4 and earlier, reads were not split - this means is that occasionally two simplex reads would be called as one read (biologically, this happens because two reads follow each other very quickly in a pore).

So, v0.4.0 and above are likely giving you the correct N50, L50 etc. This is why the number of bases aligned is very similar but the L50 and N50 are different.

Thanks for your detailed explanation. I think that explains most of it.

Do you know there are special handling for reads that are concatenation of two strands of the same molecule? I still saw them in 0.5.x data. So is it being handled but missed a small number of them or not handled at all for now?

A reply in this thread says 4.3 model is not optimized for 0.4.x https://github.com/nanoporetech/dorado/issues/691 Yet I am getting slightly better numbers for 0.4.x than 0.5.x. What's going on? Is it possible to tune 0.5.x to make it perform closer to 0.4.x?

davidnewman02 commented 6 months ago

Hi @ymcki

There was one further functional change between Dorado v0.4.3 and v0.5.3, which was enabling adapter trimming (the full change list is available here: https://github.com/nanoporetech/dorado/blob/release-v0.5.3/CHANGELOG.md ).

In your results there are slightly more alignments in Dorado-v0.5.3, which may be due some additional reads crossing the alignment_score threshold now that the (unalignable) adapter sequence is excluded during alignment. The reads which sit very close to the alignment classification threshold are likely to be lower accuracy, which I believe is what is affecting your mismatch rate.

I would expect that taking results only from the reads which align with both v0.4.3/v0.5.3 will show very similar results. Alternatively Dorado provides a --no-trim parameter which disables this behaviour; although we do not expect this to be necessary for the majority of workflows.

NB: although the version names for the models (v4.3.0) and Dorado versions (v0.4.3/v0.5.3) are similar, this is by coincidence and there are no specific optimisations to match similarly named model/ basecaller versions.

ymcki commented 6 months ago

As a continuation of my benchmarking work, I looked into the alignments of the longest reads. Typically, I found that the so-called the "longest" reads (ie >200Kbp) are either a concatenation of many reads (they cannot be structural variations as I am aligning HG002 reads to its T2T genome) or a concatenation of forward and reverse strands of the same read. To measure the "true" read lengths, I only consider the primary and supplementary reads with flags 0,16,2048 or 2064 and use pysam's CIGAR parser to calculate "true" read length by taking the sum of numbers of 'M' and 'I'.

Then I get this table for the "true" read lengths and their NA50.

<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">

Sample 0.3.4sup4.2 0.4.3sup4.2 0.4.3sup4.3 0.5.3sup4.2 0.5.3sup4.3
Aligned (qs>=10) 5,652,805 6,378,039 6,136,224 6,338,151 6,170,386
Unmapped (qs>=10) 9,951 11,207 10,923 11,266 10,961
Pri/Supp Aln bases (qs>=10) 127,634,045,864 127,537,722,176 127,366,784,651 127,544,077,062 127,405,621,809
Pri/Supp reads (qs>=10) 6,668,152 6,654,021 6,505,974 6,654,067 6,505,957
Avg Length (qs>=10) 19,140.84 19,167.02 19,576.90 19,167.84 19,582.92
Longest Aligned Read (qs>=10) 122,807 112,117 122,886 112,110 112,153
NA50 (qs>=10) 27,839 27,838 27,906 27,840 27,914
LA50 (qs>=10) 1,927,543 1,926,461 1,920,442 1,926,442 1,920,626
Aligned (qs\<10) 374,496 384,302 544,008 383,422 543,983
Unmapped (qs\<10) 61,397 64,963 65,570 64,971 67,181
Pri/Supp Aln bases (qs\<10) 7,437,748,683 7,498,638,603 7,958,129,983 7,503,655,339 7,961,015,204
Pri/Supp reads (qs\<10) 460,428 470,660 632,768 470,556 633,305
Avg Length (qs\<10) 16,153.99 15,932.18 12,576.69 15,946.36 12,570.59
Longest Aligned Read (qs\<10) 93,575 93,523 94,235 93,587 94,171
NA50 (qs\<10) 26,202 26,169 25,019 26,176 25,020
LA50 (qs\<10) 118,454 119,521 130,134 119,587 130,162

Since I consider each supplementary read as a different read, so the total number of reads is more than before. We can see number of reads jumped significantly for 0.3.4. Most likely due to no or lack of read splitting. On the other hand, 4.3.0 model has a bigger jump than 4.2.0 model, so perhaps read splitting for 4.3.0 model is not as optimized as 4.2.0 model?

Interestingly, average "true" read length is ~2.2% longer for 4.3.0 model than 4.2.0 model for reads with qs>=10. However, NA50 seems to be the same for all five cases. On the other hand, for qs<10, 4.3.0 models are shorter than 4.2.0 models for both average "true" read length and NA50.

So the conclusion of all these exercises is that 4.3.0 model is better than 4.2.0 model. However, 0.4.3 dorado seems to have an slight edge over 0.5.3.