Alignment of duplex reads is incorrect or truncated in some cases, giving truncated or poor quality reads?

Delayed-Gitification commented 1 year ago

Hi,

I was hoping someone could give some insight into this problem I am having.

For my amplicon duplex reads (length ~ 1.2kb) I find that some basecall in duplex excellently using the latest Dorado version, with a median Q score of above 30.

However, for a large majority of them, I find that the duplex read is truncated: only part of the region covered by both reads is basecalled in duplex.

Furthermore, in some cases (not particularly rare, >10% of my duplex reads) the alignment is completely wrong, and results in a duplex read that is misaligned and entirely incorrect (median Q scores of < 10)

I attach a screenshot of an extreme example (though these are not at all uncommon in my dataset!). Here, the top read is the forward read (median Q score of above 20). The bottom read is the reverse read (less good, but still ok, median Q score of around 16). The read in the middle is the duplex read that is derived from these.

You can see that the forward and reverse read cover the same region and have a very similar sequence (as expected!). However, the duplex read is misaligned and its sequence is entirely wrong. As expected, Q scores are very low (median <8)

So two questions:

Is it expected that the duplex read is often truncated compared to the simplex reads that give rise to it?
Any idea why in some cases the duplex alignment and basecalling seems to be failing so dramatically?

Thanks again to the developers for all their work on this

(note - I've used the forward read as the reference here and aligned the duplex read and the reverse read to it)

Screenshot 2023-10-27 at 16 20 42

Delayed-Gitification commented 1 year ago

example.zip I attach the pod5 and bam file (default parameters, super accuracy 4khz dna_r10.4.1_e8.2_400bps_sup@v4.1.0) for the example above

tijyojwad commented 1 year ago

Hi @Delayed-Gitification - thanks for sharing the data!

My hunch is our heuristics are picking up false positive pairs because of amplicons. In general duplex doesn't work super well with amplicons yet. We are going to release some updates with our next release (expected in a day or two) that should help with this. I'll ping on this thread once that is out, and would be great to get your feedback.

Delayed-Gitification commented 1 year ago

Oh that's great news, looking forward to it.

In this case they are definitely true positive (experiment was designed to ensure this) so hopefully the updates you are releasing fix this!

tijyojwad commented 1 year ago

Hi @Delayed-Gitification - we just released the updated version of dorado (v0.4.2) - https://github.com/nanoporetech/dorado#installation . Please let me know if you see some improvements. The main change is to limit only adjacent reads when ordered by sequencing time for pairing.

Delayed-Gitification commented 1 year ago

Unfortunately I don't see an improvement here

tijyojwad commented 1 year ago

Got it, thank you for testing! We'll have a look at your sample dataset

Delayed-Gitification commented 1 year ago

Thanks! Just to note that the two reads in the .pod5 file have been split by some custom software. So some of the metadata values have been imputed. The actual signal is unchanged though, except for being split into two, and simplex basecalling works very well for both pod5 entries.

(The read is derived from an amplicon with a hairpin adapter at one end, meaning both strands are read in a single read. My code then splits this into two reads (upstream and downstream of the hairpin adapter) in a new pod5 file.)

nanoporetech / dorado

Alignment of duplex reads is incorrect or truncated in some cases, giving truncated or poor quality reads? #441