nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
439 stars 53 forks source link

Megabase reads in v5 models #905

Open diego-rt opened 1 week ago

diego-rt commented 1 week ago

Hi again,

I noticed that since upgrading to the v5 model, I have a lot more megabase-size reads in ULK114 data than with v4 models, albeit most of them seem to have low qualities of Q5-Q10. To put some numbers, with v4.3 I was getting 199 reads above 1 Mbp, while with v5 I have 2433 reads above that length across several flowcells. That being said, only 18 have Q values above 10.

Interestingly, samples using the new motor protein also seem to have considerably more megabase-plus reads.

I was wondering whether you guys have looked into this and have any explanation?

I always assumed that those megareads with low quality values were molecules that had gotten stuck in the pore and produced artifactual signal for a really long time, but could it be that instead they are "normal" reads but the basecalling had so far not been able to decode them? Perhaps the signal is somehow a bit different/slower to the shorter reads due to the added bulk?

mprous1 commented 4 days ago

I have also noticed that with the v5 basecalling model my datasets consistently get longer N50 than with previous models. Does not seem to be specific to very long reads, N50 has been usually one or few kb for me. I actually would have expected to get smaller N50 with more accurate models if it recognizes chimeric reads better. So I do wonder if this increased N50 is now better or not.

susie-ont commented 3 days ago

Hi @diego-rt & @mprous1 - thanks for raising this, which version(s) of Dorado were you using? If you were basecalling using different versions of Dorado as well as different model versions, this may be caused by changes in trimming or read splitting in different Dorado versions.

diego-rt commented 3 days ago

Hi, this was v5@sup with dorado 0.7.1 and v4.3@sup with dorado 0.6.0

mprous1 commented 1 day ago

The trimming and splitting options should be exactly the same. In comparison between dorado 0.5 (don't remember exactly if 0.5.1 or 0.5.2...) and 0.7.0 using the same commands (trimming and read splitting enabled by default), 0.7 produces longer N50 for both, simplex only and duplex only. I had a closer look at one sample. The difference is small, e.g. 5913 kb vs 5974 for simplex only reads. There are fewer simplex only reads but more duplex reads with dorado 0.7, which makes sense. Despite of fewer simplex only reads, the total amount of bp for these reads is higher (about 0.5%) with 0.7. Based on this it seems to me that repetitive regions are better estimated perhaps with dorado 0.7 rather than more chimeric reads. So the slight increase in N50 maybe is a good sign after all.