Open vzg100 opened 9 months ago
Hello @vzg100,
I was able to replicate the problem. It appears that when you run dorado aligner
with fastq input it does not keep the MM/ML/MN tags. Instead of converting to fastq (conversion script), try using the output from dorado directly. You can even skip the alignment step:
dorado basecaller \
dna_r10.4.1_e8.2_400bps_sup@v4.3.0 sample.pod5 \
# Use a valid kit name, of course.
--kit-name kit \
# pass a reference to basecaller and align at the same time
--ref assembly.fasta \
--modified-bases "5mCG_5hmCG" \
> sample_mapped.bam
# ... samtools sort, index, etc.
modkit ..
Hi @ArtRand,
Thank you for the quick and thoughtful reply.
My bad if I failed to RTFM, but how do you separate reads by barcode using this approach?
Hello @vzg100,
Dorado will handle that as well (barcode docs). Make sure to use the latest version >=0.5.2, before that there was a bug with trimming the modified base tags.
Hi @ArtRand,
Again, thank you for the fast response! Based on my understanding of the dorado docs I don't think that workflow would work particularly well for bacterial assemblies. For each barcode we have a different assembly. So we would need to be able to filter reads by barcode before mapping.
Unless there is a way to filter by barcode in the basecaller?
Hello @vzg100,
Maybe I'm missing something obvious, but could you partition your reads based on the BC
tag?
$ samtools view -bh --tag BC:<barcode> ${bam} -o ${barcode_bam}
All workflows are different - so I understand if you already have something that's working and don't want to change. You could use modkit repair
(docs) to replace the MM/ML/MN tags in your alignments.
Hi @ArtRand,
Thank you for bringing up the BC tag for samtools - I wasn't aware of that!
I'll go ahead and try that out
@vzg100 any luck?
Hi @ArtRand,
Thank you for your patience and suggestions - rebasecalling took minute
Unfortunately running the code blow didn't work:
dorado basecaller dna_r10.4.1_e8.2_400bps_sup@v4.3.0 pod5_dir \
--kit-name ${kit} \
--reference ${assembly} \
--modified-bases "5mCG_5hmCG" > sample.bam
samtools view -bh --tag BC:01 sample.bam -o sample_filtered.bam
samtools sort sample_filtered.bam -o sample_sorted_filtered.bam
samtools index sample_sorted_filtered.bam -o sample_sorted_filtered.bai
What ended up working was demultiplexing with dorado demux and then running the (sorted and indexed) demuxed bam file through modkit. Though this approach involves basecalling twice - first to make the assembly and then second to align the modified bases.
Hello @vzg100,
Could you elaborate on what exactly didn't work? Where are you loosing the base modification calls? You should definitely not need to basecall your data twice. At the very least if there is a step where you loose the MM/ML/MN tags you can use modkit repair
to project the base modifications back onto the reads.
I'm trying to extract methylation data from bacterial genomes using Dorado 0.5.0 and modkit 2.4 (using the bioconda hosted package).
I've aligned my reads using Dorado align and sorted/indexed my BAM file (shown below). However, when I run modkit I encounter the following error for every read I map: `Skipped: AUX data not found
I'm not really sure how to debug this error - is it just a parameter I'm missing or is there something else?
Alignment script:
Basecalling Script:
It is worth noting the basecalling script basecalls each pod5 seperately and then merges and demultiplexes them. I'm on a cluster so it's easy to run in parallel across multiple GPUs.
Conversion Script: