nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
458 stars 56 forks source link

[question] duplex and barcode demultiplexing? #522

Open JWDebler opened 7 months ago

JWDebler commented 7 months ago

Hi, I am wondering if there is currenly a way to do barcode demultiplexing and duplex calling using only dorado, and if so, what would be the best way to do it?

vellamike commented 7 months ago

Hi @JWDebler , duplex barcode demux is currently not implemented in Dorado. The best way to currently do it would be to demux the simplex reads and then use the duplex read IDs (which are a concatenation of simplex read IDs) to demux.

tijyojwad commented 7 months ago

we have this on our roadmap and we plan to implement this for a future release

vdruelle commented 3 months ago

Are there updates on the timeline to release barcode demultiplexing for duplex ? It seems to be a heavily requested feature for a while now... Since the duplex reads are already in the data it's a shame to not be able to use them :/

JWDebler commented 3 months ago

You can extract them, it's just a bit annoying as it basically means basecalling twice. At least that's what we are currently doing. Basecall in simplex mode, classify into barcodes, extract barcode reads into their own pod5 files and then duplex call those individually:

location=/path/to/raw/data/containing/folder

dorado basecaller sup -r $location --kit-name SQK-NBD114-24 > all.bam && \
dorado demux --output-dir simplex --no-classify all.bam && \
rm simplex/*unclassified* && \
for file in simplex/*.bam; do id=$(echo "$file" | grep -oP 'barcode\d+'); samtools view -h $file | cut -f 1 > simplex/$id.readids.txt; done && \
for file in simplex/*readids.txt; do id=$(echo "$file" | grep -oP 'barcode\d+'); grep -v "@" $file > simplex/$id.clean.txt; done && \
for file in simplex/*clean.txt; do id=$(echo "$file" | grep -oP 'barcode\d+'); pod5 filter -r $location -i $file -t 10 --missing-ok --duplicate-ok --output $id.pod5; done && \
for file in *.pod5; do id=$(echo "$file" | grep -oP 'barcode\d+'); dorado duplex sup $file > $id.duplex.untrimmed.bam; done && \
for file in *untrimmed.bam; do id=$(echo "$file" | grep -oP 'barcode\d+'); dorado trim $file > $id.duplex.bam && \
rm *untrimmed.bam && \
for file in *duplex.bam; do id=$(echo "$file" | grep -oP 'barcode\d+'); samtools view  -@ 8 -O fastq -d dx:1 $file | pigz -9 > $id.duplex.fastq.gz; done && \
for file in *duplex.bam; do id=$(echo "$file" | grep -oP 'barcode\d+'); samtools view  -@ 8 -O fastq -d dx:0 $file | pigz -9 > $id.simplex.fastq.gz; done 
vdruelle commented 3 months ago

I see, thanks a lot for the details of your implementation !

mbhall88 commented 3 months ago

I created this script to demux a duplex basecalled BAM. Takes a summary file after demultiplexing simplex reads and your duplex called BAM. There’s some options for dealing with situations where the two reads that make a duplex read don’t have the same barcode in simplex mode or if one is unclassified. Also supports fastq output

vdruelle commented 3 months ago

Thanks for your answer and the link to your code ! Great work on the manuscript as well, it's super nice to see the variant calling metrics there (also shows that duplex is maybe not essential ^^).

JWDebler commented 3 months ago

interesting @mbhall88. I'm a bit confused. How did you get barcodes in the duplex bam file? I thought dorado couldn't do duplex calling and barcode assignment at the same time yet, which is why we're doing it the long way around.

mbhall88 commented 3 months ago

There’s no barcodes in the duplex bam. But that script uses the assignment from the sequencing summary file from the simplex basecalling to assign barcodes for the duplex reads. I also found cases when testing that there were sometime simplex reads that are part of a duplex read that have different barcode assignments (normally it was just that one was unclassified and the other was classified) so I added options for how you want to deal with this

JWDebler commented 3 months ago

Ahhh, ok, that makes sense. Do you do simplex live calling during the run then? We always use post run basecalling, and I don't think that produces the summary file.

mbhall88 commented 3 months ago

No I generally basecall after the fact also. But you can create a summary file with https://github.com/nanoporetech/dorado?tab=readme-ov-file#sequencing-summary

the demux command also has an —emit-summary option which will place a summary file in the output directory (see https://github.com/nanoporetech/dorado?tab=readme-ov-file#classifying-existing-datasets)

vdruelle commented 3 months ago

Thanks for the details, I'll give it a try ! Looking at your script it does not seem super complicated to demultiplex duplex reads. I'm a bit puzzled why there is no native way of doing it with Dorado...

hagen-wende commented 2 months ago

This is my current implementation, maybe useful for somebody ... Assuming you have seqkit and samtools installed in the respective conda environments you can run this bash script after simplex basecalling from the nanopore output directory (the one that contains the fastq_pass and pod5 folder)

dorado_duplex_bact.sh

This uses the bacterial research model for the basecalling, just change this to "sup" or "hac" in line 53 if you want to use the standard models

luckybillion commented 4 days ago

You can extract them, it's just a bit annoying as it basically means basecalling twice. At least that's what we are currently doing. Basecall in simplex mode, classify into barcodes, extract barcode reads into their own pod5 files and then duplex call those individually:


location=/path/to/raw/data/containing/folder

dorado basecaller sup -r $location --kit-name SQK-NBD114-24 > all.bam && \

Just to clarify, when running doradobase caller would --no-trim need to be added for the subsequent demultiplex step to work? my understanding is that otherwise the barcodes are removed and dorado demux won't work?