nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
531 stars 63 forks source link

dorado correct loses/discards mitochondrial reads (sometimes) #996

Open JWDebler opened 2 months ago

JWDebler commented 2 months ago

Hi,

we recently sequenced a batch of 19 fungal isolates. I tried to assemble the genomes a few different ways (simplex + duplex or simplex corrected only) in order to figure out if there is still a need to do the longer duplex calling pipeline.

Corrected simplex reads turned out to give good assemblies, for most samples that is. In my particular case 7 of those 19 ended up with no mitochondrial reads after correction with dorado.

Mapping the uncorrected simplex reads onto the assemby leads to a crazy coverage for the mitogenome of over 2000x, but 0x for the corrected set generated from those raw reads. Does the correction algorithm discard coverage anomalies like this? All 19 assemblies have mitogenome reads in the duplex set and the uncorrected simplex reads, but for 7 the correction step throws all of them out.

Cheers.

HalfPhoton commented 2 months ago

Please see here for similar issues under discussion https://github.com/nanoporetech/dorado/issues/968 https://github.com/nanoporetech/dorado/issues/962

svc-jstone commented 2 months ago

Does the correction algorithm discard coverage anomalies like this?

Dorado Correct does not explicitly discard high-coverage reads, but keep in mind that Minimap2 (used under the hood for overlapping) does have a frequency filter for kmers. Are your entire datasets of 2000x coverage, or just the mitochondria?

JWDebler commented 2 months ago

Only the mitochondria. I aim for about 50x coverage of the genome, but since there are many copies of the mitogenome per cell I often get crazy high coverage for them.

svc-jstone commented 2 months ago

What's the expected length of your mitogenome, and what does your read length distribution look like?

JWDebler commented 2 months ago

The mitogenome is 55 kb. Here are the nanoplot read distributions for the 'raw' simplex reads before correction for all 19 isolates, the ones one the left contained mitogenome reads after correction, the ones on the right lost them all.

The only thing I can see is that the losers do have a very high 'short' read peak, but that is also present in some of the 'maintainers'.

image