nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
493 stars 59 forks source link

A few duplex questions #777

Closed godotgildor closed 5 months ago

godotgildor commented 5 months ago

I am interested in duplex calling using Dorado but have a few questions that I can't find the answers to.

The Dorado readme describes a process to split pod5 files to enable distributed duplex calling here.

  1. From previous duplex descriptions, my understanding was that paired duplex reads would be from the same pore and would immediately follow one another temporarily. Would they not then be expected to show up in the same pod5 file?
  2. It's not clear to me how reads are paired by the software. Since the pod5 split procedure splits on well, it does seem that pairing will only pair reads from the same well (which matches my previous understanding). But is the temporal aspect considered? I wonder because I am sequencing an amplicon library of DNA where individual members of the library are closely related - same length and very similar sequence. Without some strict pairing rules, like a pair must transit the same pore within X time and be the opposite orientation, then I would be concerned that I might pair different members of my library, mistakenly thinking that they are the same DNA since they are the same size and very similar sequence, but in fact they are different library members. The resulting duplex sequence would be essentially an average of two similar, but not identical DNA strands?
  3. Given my use case of sequencing an amplicon of closely related sequences, would trying to leverage duplex calling be problematic? Any processes I should perform special for this use case?
vellamike commented 5 months ago

From previous duplex descriptions, my understanding was that paired duplex reads would be from the same pore and would immediately follow one another temporarily. Would they not then be expected to show up in the same pod5 file?

Yes, they would generally be expected to show up in the same POD5 file. The reason to split the pod5 files as described is a performance optimisation to improve disk access patterns by having pairs more closely located on disk.

But is the temporal aspect considered? I wonder because I am sequencing an amplicon library of DNA where individual members of the library are closely related - same length and very similar sequence. Without some strict pairing rules, like a pair must transit the same pore within X time and be the opposite orientation, then I would be concerned that I might pair different members of my library, mistakenly thinking that they are the same DNA since they are the same size and very similar sequence, but in fact they are different library members. The resulting duplex sequence would be essentially an average of two similar, but not identical DNA strands?

Yes, the time delta and order of pairs is considered.

Given my use case of sequencing an amplicon of closely related sequences, would trying to leverage duplex calling be problematic? Any processes I should perform special for this use case?

Given your use case (amplicons of closely related sequence) I advise that you do not use duplex. We are working on updates which will improve duplex calling in amplicons, but at the present time there is a risk of mis-pairing.