Differences in number of reads when using duplex and simplex basecalling

LeLeiu commented 1 month ago

Hello

I am currently trying to compare simplex base calling with duplex base calling using Dorado. I am using raw binary outputs from the Nanopore sequencer (.pod5 files) with R10.4 flow cell.

For each of those files, I do simplex base calling using the last hac model, and duplex base calling with the latest sup model. However, I end up with more reads after duplex base calling than simplex base calling for the same pod5 file used in input.

Is this supposed to happen ? And if so, why ? This may be because of the different model used (hac compared to sup), but it is just a guess. I would've thought that the "duplex reads" that you get with duplex base calling would be included in the ones in simplex base calling but higher quality since the Nanopore reads both strand of the DNA to produce an output.

Further more I do not understand what the different tags for reads in the bam file avec base calling on the Github page:

The dx tag in the BAM record for each read can be used to distinguish between simplex and duplex reads: -dx:i:1 for duplex reads -dx:i:0 for simplex reads which don't have duplex offsprings. -dx:i:-1 for simplex reads which have duplex offsprings.

For me, dx:i:1 are the reads base called using the sequence of both strands of the DNA. dx:i:0 are the reads base called only using one strand of the DNA as the complementary strand one did not go through the Nanopore.

It is the dx:i:-1 tag that I don't understand, are those reads the two strands of DNA sequenced to produce a duplex read ? Should I keep all those reads to continue my analysis of Nanopore data or should I delete some as there may be some reads that are counted multiple times ? I have compared the amount of the different tags in my bam file to see if these numbers line up with the different amounts of reads with simplex/duplex outputs but they don't.

I am just having trouble understanding this part of the documentation and why I have more reads when using duplex instead of simplex. If anyone could enlighten me I would greatly appreciate it.

Thank you!

HalfPhoton commented 1 month ago

Hi @LeLeiu, Can you be more specific with how many more reads you're seeing in duplex? Also, how are you counting the number of reads in the output bam?

LeLeiu commented 1 month ago

Hi @HalfPhoton, For example for one sample (so the same pod5 file) I get 648 955 reads in duplex and 583 578 in simplex. I tested both fastq and bam outputs.

Duplex basecalling: For fastq I use Falco to do a quality control and get basic statistics on my data. For bam outputs I first translate them into sam files with sam tools and then count the number by doing a ctrl +F of the three different tags (dx:i:0, dx:i:1 and dx:i:-1) and see if it matches with the number of total reads in Falco.

For simplex reads: Since there are no tags in the bam outputs with simplex basecalling I just rely on the Falco output with fastq format.

In average I would say I have 10-15% more reads in duplex compared to simplex.

tijyojwad commented 1 month ago

Hi @LeLeiu - this is expected because the output of duplex basecalling will include both simplex and duplex reads (as described by the tags you highlighted in the original post).

Duplex reads are generated from a pair of simplex reads that satisfy the duplex pairing requirements. So a read that's dx:1 is such a duplex read.

dx:0 are simplex reads which didn't participate in any duplex read generation because, like you said, the complement pair was not found.

dx:-1 are the simplex reads which did participate in a duplex read generation. We separate out this tag so that users who want to remove all simplex reads from which a duplex was generated can do so easily.

LeLeiu commented 1 month ago

Hi @tijyojwad Thank you for your reply.

I am working with gut microbiota data from different people. If were to compare both simplex and duplex basecalling on the same pod5 file (so gut microbiota from the same person), I would need to exclude reads with the dx:-1 tag as they would contain the same information as the dx:0 reads ? Meaning I would have some information that is doubled and could bias the biological interpretations ?

tijyojwad commented 1 month ago

Hi @LeLeiu - the dx:-1 reads are the ones used to generate the duplex reads. So they would contain overlapping information with the dx:1 reads. The dx:0 reads are the ones that didn't participate in duplex generation, so that data is not redundant.

LeLeiu commented 1 month ago

Hi @tijyojwad Yes sorry I meant to say dx:1 reads. Thank you for your help!

nanoporetech / dorado

Differences in number of reads when using duplex and simplex basecalling #815