nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
482 stars 59 forks source link

Very few duplex reads with short amplicons #268

Open HenkvdMeulen opened 1 year ago

HenkvdMeulen commented 1 year ago

Somehow I'm ending up with a very low number of duplex reads after using Dorado duplex basecalling with the following command (used the ligation sequencing kit V14 SQK-LSK114):

dorado duplex dna_r10.4.1_e8.2_400bps_sup@v4.1.0 pod5/directory/ > duplex.bam

I extracted the duplex reads from this file using samtools view duplexBarcoded/barcode01.bam | grep "dx:i:1"| samtools view -O BAM -o duplexOnly.bam

However, only 615 out of nearly 2M reads turn out to be duplex reads. What could be the issue here? Does anyone have suggestions as to what might be going wrong and how I could solve it? Thanks in advance!

vellamike commented 1 year ago

Hi @HenkvdMeulen

A couple of questions:

  1. What flowcells are you using? Are these high duplex flowcells?
  2. How long are your reads? What is their N50?
HenkvdMeulen commented 1 year ago

Hi @vellamike

  1. I'm using the R10.4.1 flow cell. is there any difference among those, I would assume not?
  2. The amplicon length is expected to be around 450 bp with primers, tails and barcodes. For some reason the MinION did not produce a run report that I can check now, However, there was a very tight peak in the read length around the expected amplicon size. hope that is sufficiently helpful, sorry! I tried to retrieve the report from MinKnow but it doesn't show the experiment anymore, even though I am logged in and changed the number of past days to the right amount.
vellamike commented 1 year ago

Hi @HenkvdMeulen,

Dorado has a known issue whereby duplex basecalling of short reads (<1kb) results in low yields. This is an result of the duplex algorithm which has been optimised for longer reads. We are working on a fix and will update this issue when we release it, but at present what you are trying to do will unfortunately not work very well.

Mike

HenkvdMeulen commented 1 year ago

Hi @vellamike, That's good to know, thanks! I will patiently wait for the update to arrive. Do you have any estimation on the time until we could expect this fix?

Best regards, Henk

HenkvdMeulen commented 1 year ago

Hi @vellamike, is there any progress yet in regard to this issue? Do you know when this fix will be available in an update? Thanks!

Best regards, Henk

quintonwessells commented 1 year ago

Hi @vellamike

I'm also trying to perform duplex basecalling on reads < 1kb. It is tagging most reads with dx:i:0 (simplex without duplex offpring) and some reads with dx:i:-1 (simplex reads which have duplex offsprings), but no reads actually get tagged with dx:i:1 (duplex). Shouldn't it produce some duplex reads tagged with dx:i:1 if there are dx:i:-1 simplex reads present?

Thanks

tijyojwad commented 1 year ago

Hi @quintonwessells which version of dorado are you using? your issue sounds like a bug we fixed in 0.3.4

we haven't made any explicit improvements for short read duplex yet. we made some updates to long read duplex yield in the most recent release, and we are in the process of discussing short read related changes.

quintonwessells commented 1 year ago

Hi @tijyojwad, I was using 0.3.3. Just tried it with 0.3.4 and now I'm not getting any dx:i:-1 or dx:i:1 reads, so seems that is sort of fixed even though I can tell some of the reads marked as simplex are actually duplex. Thanks for keeping us posted on status of short read related changes!

tijyojwad commented 1 year ago

even though I can tell some of the reads marked as simplex are actually duplex

what do you mean by this?

alexanderwg-ornl commented 1 year ago

I think I might be running into the same length limitation. I was playing around with forcing dorado to duplex call reads I knew were complementary to each other by passing the --pairs flag, but it appears that the reads are filtered. Below is a partial output of my experimental run. Is dorado filtering by length or is there another criteria I'm missing that it's using to reject this pair?

[2023-09-07 11:26:51.844] [info] > Starting Stereo Duplex pipeline [2023-09-07 11:26:51.853] [info] > Reading read channel info [2023-09-07 11:26:51.854] [info] > Processed read channel info [2023-09-07 11:26:54.204] [debug] hits 1, mapq 60, overlap length 316, overlap frac 0.8885542, delta -1103, read 1 376, read 2 332, strand -, pass true, accepted true, temp start 27 temp end 343, comp start 18 comp end 313, 7239544e-c176-4d8b-93d4-8cc9da1bac8f and f2f76b5d-0c16-453b-9720-2efd1ca01273 [2023-09-07 11:26:54.964] [info] > Duplex reads basecalled: 0 [2023-09-07 11:26:54.964] [info] > Duplex reads filtered: 1 [2023-09-07 11:26:54.964] [info] > Duplex rate: 0.56497175% [2023-09-07 11:26:54.964] [info] > Basecalled @ Bases/s: 3.552131e+01

tijyojwad commented 1 year ago

Unless the min-qscore option was specified in the cmdline, the only other filter enabled by default is the min sequence length, which is 5. so even though the pairing is accepted and a duplex read is called, the length could be super small. probably because the model hasn't really seen data of that size so it doesn't know what to do with it

alexanderwg-ornl commented 1 year ago

Unless the min-qscore option was specified in the cmdline, the only other filter enabled by default is the min sequence length, which is 5. so even though the pairing is accepted and a duplex read is called, the length could be super small. probably because the model hasn't really seen data of that size so it doesn't know what to do with it

Thanks for the reply! According to dorado's help file for duplex, the min-qscore default is 0 so I never altered it; does that mean the filter is off or is it behaving in some other way? I also don't see a minimum sequence length flag in the help documentation, is this an actual option I can set?

tijyojwad commented 1 year ago

does that mean the filter is off or is it behaving in some other way?

correct, min qscore 0 means no filtering is happened based on the read's mean qscore

I also don't see a minimum sequence length flag in the help documentation, is this an actual option I can set?

unfortunately not. but that minimum length is 5, which we thought was a good threshold and anything smaller than that is most likely garbage anyway. from your log I'd expect the duplex read to be ~200 bp or longer...

alexanderwg-ornl commented 1 year ago

I see, so it sounds like it is the same problem as the OP: the duplex basecaller is rejecting these shorter alignments. Thanks for your insight!

Delayed-Gitification commented 10 months ago

@vellamike Do you have any info on when duplex pairing for short amplicons might be ready? Thanks!

vellamike commented 9 months ago

Apologies for the delay in replying. Duplex pairing for short amplicons is not something which we will deliver soon. It is currently considered a research problem.