nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
495 stars 59 forks source link

demultiplexing barcodes flanked by UMIs #591

Closed RainerWaldmann closed 6 months ago

RainerWaldmann commented 8 months ago

I need to demultiplex barcodes that are flanked by UMIs (NEB Unique Dual Index UMI Adaptors). Left flanking sequence - BARCODE - UMI(N8) - right flanking sequence Before we generate the libraries with UMIs it would be great if we knew in advance whether we'll get this demultiplexed with Dorado. For double indexing with Illumina 10 nt barcodes we currently use the following masks

IllRW_1st AACAAGCAGAAGACGGCATACGAGATNNNNNNNNNNGTCTCGTGGGCTCGGAGATGTG IllRW_2nd CGACCACCGAGATCTACACNNNNNNNNNNTCGTCGGCAGCGT and a list of barcodes. This works for barcodes without UMIs With the UMI adapters the barcodes are directly flanked by a 8 nt UMI , a random sequence. Will adding 8 additional Ns to the mask and adding 8 Ns to each barcode sequence work? If not, are there any other options to demultiplex barcodes flanked by Ns with dorado.

Thanks, Rainer

tijyojwad commented 8 months ago

Hi @RainerWaldmann

Will adding 8 additional Ns to the mask and adding 8 Ns to each barcode sequence work?

This would be my suggestion too. But unfortunately we don't support the barcode matching part with Ns in dorado right now. However that support can be added, and a valuable feature to support for UMI based workloads. If I provide you with a build would you be able to test it out?

You may also have to adjust some score thresholds, since the barcode score is calculated as 1.0 - (edit_dist / barcode_length). With Ns in the barcode the barcode length is increased but the edit distance may still be low, causing the scores to appear mode inflated.

RainerWaldmann commented 8 months ago

Hi @tijyojwad

Adding Ns to the barcode would be rather a workaround associated with the issues you mentioned. Another option that might be better, if the current code supports it, is to provide a mask with just one flanking sequence and the barcodes as usual. e.g. the mask for the barcode that is flanked by the UMI llRW_1st AACAAGCAGAAGACGGCATACGAGATNNNNNNNNNN this would avoid the scoring issues. Depends on whether the current software supports this. Identification of the start of position of the barcode should be precise enough with the current Nanopore sequencing accuracy.

A more ideal solution, if you plan to have a more universal and extendable option, would be a mask where Ns define the barcode sequence and another character e.g. Z defines the UMI. e.g. llRW_1st AACAAGCAGAAGACGGCATACGAGATNNNNNNNNNNZZZZZZZGTCTCGTGGGCTCGGAGATGTG

I guess UMIs will be increasingly used in Nanopore sequencing and such a mask would also allow extraction of the UMI sequence in the long run. Could even do the first part of the single cell workflow (cell barcode and UMI extraction).

Initially I was thinking about adapting the Java code I wrote for single cell barcode and UMI extraction (ucagenomix/sicelore-2.1). But I guess this should be possible with Dorado and I would be interested to give it a try.

tijyojwad commented 8 months ago

Hi @RainerWaldmann -

Another option that might be better, if the current code supports it, is to provide a mask with just one flanking sequence and the barcodes as usual.

Yes this is already supported. In the custom barcode arrangement you can just leave the mask1_rear empty and dorado will only use the sequence from the front flank.

We'll take into consideration the additional mask for UMIs! It's a good suggestion

tijyojwad commented 6 months ago

HI @RainerWaldmann - I'm closing this ticket since I haven't heard back from you on using the mask1_rear option for your custom setup. I'd be curious to know if it worked out if you're willing to share results!

RainerWaldmann commented 6 months ago

Hi Joyjit

I'll try it within the next two weeks. We had some delays generating the libraries.

RainerWaldmann commented 5 months ago

Hi @tijyojwad

I tried with the mask1rear empty (the side where the UMI is located). Works better than I expected. Despite the very short ( 8 nt.) NEB barcodes (96 plex double indexing). The wrong barcode with the most reads gets 10,000 x less reads than the correct barcode.

tijyojwad commented 5 months ago

Awesome glad to hear it!