nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
531 stars 63 forks source link

Dorado demultiplexing question #809

Closed ljwharbers closed 5 months ago

ljwharbers commented 6 months ago

Hi,

I was slightly confused about how the demultiplexing with custom barcoding works. I have a read set up as follows:

5' Read Primers -- Adaptor 1 -- Barcode (32nt) --Spacer [known sequence] -- PolyA -- cDNA -- UMI (8nt) -- Adaptor 2 -- Read Primers 3'

Furthermore, I don't have just 96 possible barcodes but I have a very large list of possibilities (hundred thousands/few million). Do you know if it's possible to demultiplex this with your current implementation, and if yes, could you help me get started on doing this with the arrangement options as specified here?

Many thanks in advance, Luuk

tijyojwad commented 6 months ago

Hi @ljwharbers - from your description it looks like you might be attempting a UMI clustering type approach? I believe there are workflows out there that do this - maybe something like https://github.com/epi2me-labs/wf-single-cell?tab=readme-ov-file#pipeline-overview ?

In general that isn't supported well with the current design, which is designed primarily for barcodes. The barcode setup can be jerry-rigged into doing it - your custom sequences fasta will contain thousands/millions of sequences, and the arrangement will contain only a rear flank sequence (which is the spacer). but this will be excruciatingly slow and likely yield pretty poor results anyway, so I wouldn't recommend trying it.

ljwharbers commented 6 months ago

Hi @tijyojwad,

Thanks for the quick response! I have indeed looked into and used the wf-single-cell pipeline. However, when it comes to allowing for 1-2 mismatches in the barcode section it's simply too slow (would practically never finish). I was hoping that perhaps the implementation here would work better. Ignoring the UMI bit (that I can deal with another way). Isn't it correct that you do actually have both the front flank sequence (adaptor) and the rear end sequence (spacer), which shouldnt make it that slow then?

Thanks again for the quick response. If you think it is still not advisable I will find a different solution and you're welcome to close the issue.

tijyojwad commented 5 months ago

Indeed it should work, but for each read dorado would basically cycle through a million alignments - which would be very slow. I suspect for finding such matches maybe there's a smarter approach. Right now we don't have plans to support something like this in dorado, but please let us know if you do try using dorado and how it turns out :).