nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
491 stars 59 forks source link

Duplex and simplex read numbers do not correspond #877

Closed stas-malavin closed 3 months ago

stas-malavin commented 3 months ago

Hi, For my duplex-called reads, I get

$ A=$(samtools view all.bam | grep -c 'dx:i:0')
$ echo $A
622019
$ B=$(samtools view all.bam | grep -c 'dx:i:1')
$ echo $B
304276
$ C=$(samtools view all.bam | grep -c 'dx:i:-1')
$ echo $C
525843

From my understanding of the Dorado documentation here on GitHub, C=2B, however, this is not the case. Is my understanding incorrect? Sorry if I've missed something, I didn't find the exact explanation in Issues. Thanks

I'm using:

dorado:   0.7.1+80da5f5+cu11080
libtorch: 2.0.0-ont
minimap2: 2.27-r1193
malton-ont commented 3 months ago

Hi @stas-malavin,

When you run the duplex basecall, do you see a line in the output that looks like:

"> Simplex reads filtered: 22000

Dorado automatically filters out reads that contain fewer that 5 bases. If you have lots of short reads, it's possible that you have some duplex pairs made from simplex reads where one of the parent reads is being filtered out (e.g. if you had a simplex of 4 bases paired with a simplex of 5 bases that created a duplex of 5 bases, you'd drop the 4-base read but still see the duplex read). That would skew your ratio.

This would also work the other way around - if the overlap of the simplex parents creates a duplex read of only a few bases, we'd filter that out as well.

stas-malavin commented 3 months ago

Hi @malton-ont , Thanks a lot for the explanation, I think this is really the case here.