Skipped: AUX data not found for methylation

vzg100 commented 9 months ago

I'm trying to extract methylation data from bacterial genomes using Dorado 0.5.0 and modkit 2.4 (using the bioconda hosted package).

I've aligned my reads using Dorado align and sorted/indexed my BAM file (shown below). However, when I run modkit I encounter the following error for every read I map: `Skipped: AUX data not found

failed to get modbase info for record b21ac088-3198-4294-a334-8a76070f5c97`.

I'm not really sure how to debug this error - is it just a parameter I'm missing or is there something else?

Alignment script:

dorado aligner assembly.fasta reads.fastq.gz  > sample.bam
samtools sort sample.bam -o sample_sorted.bam
samtools index sample_sorted.bam -o  sample_sorted.bai

Basecalling Script:

It is worth noting the basecalling script basecalls each pod5 seperately and then merges and demultiplexes them. I'm on a cluster so it's easy to run in parallel across multiple GPUs.

dorado basecaller dna_r10.4.1_e8.2_400bps_sup@v4.3.0 sample.pod5 --kit-name kit --modified-bases "5mCG_5hmCG" > sample.bam

Conversion Script:

samtools fastq -T "*" sample.bam > sample.fastq

ArtRand commented 9 months ago

Hello @vzg100,

I was able to replicate the problem. It appears that when you run dorado aligner with fastq input it does not keep the MM/ML/MN tags. Instead of converting to fastq (conversion script), try using the output from dorado directly. You can even skip the alignment step:

dorado basecaller \
  dna_r10.4.1_e8.2_400bps_sup@v4.3.0 sample.pod5 \
  # Use a valid kit name, of course.
  --kit-name kit \
  # pass a reference to basecaller and align at the same time
  --ref assembly.fasta \
  --modified-bases "5mCG_5hmCG" \
  > sample_mapped.bam

# ... samtools sort, index, etc.
modkit ..

vzg100 commented 9 months ago

Hi @ArtRand,

Thank you for the quick and thoughtful reply.

My bad if I failed to RTFM, but how do you separate reads by barcode using this approach?

ArtRand commented 9 months ago

Hello @vzg100,

Dorado will handle that as well (barcode docs). Make sure to use the latest version >=0.5.2, before that there was a bug with trimming the modified base tags.

vzg100 commented 9 months ago

Hi @ArtRand,

Again, thank you for the fast response! Based on my understanding of the dorado docs I don't think that workflow would work particularly well for bacterial assemblies. For each barcode we have a different assembly. So we would need to be able to filter reads by barcode before mapping.

Unless there is a way to filter by barcode in the basecaller?

ArtRand commented 9 months ago

Hello @vzg100,

Maybe I'm missing something obvious, but could you partition your reads based on the BC tag?

$ samtools view -bh --tag BC:<barcode> ${bam} -o ${barcode_bam}

All workflows are different - so I understand if you already have something that's working and don't want to change. You could use modkit repair (docs) to replace the MM/ML/MN tags in your alignments.

vzg100 commented 8 months ago

Hi @ArtRand,

Thank you for bringing up the BC tag for samtools - I wasn't aware of that!

I'll go ahead and try that out

ArtRand commented 8 months ago

@vzg100 any luck?

vzg100 commented 8 months ago

Hi @ArtRand,

Thank you for your patience and suggestions - rebasecalling took minute

Unfortunately running the code blow didn't work:

dorado basecaller dna_r10.4.1_e8.2_400bps_sup@v4.3.0 pod5_dir \
    --kit-name ${kit} \
    --reference ${assembly} \
    --modified-bases "5mCG_5hmCG" > sample.bam
samtools view -bh --tag BC:01 sample.bam -o sample_filtered.bam
samtools sort sample_filtered.bam -o sample_sorted_filtered.bam
samtools index sample_sorted_filtered.bam -o  sample_sorted_filtered.bai

What ended up working was demultiplexing with dorado demux and then running the (sorted and indexed) demuxed bam file through modkit. Though this approach involves basecalling twice - first to make the assembly and then second to align the modified bases.

ArtRand commented 8 months ago

Hello @vzg100,

Could you elaborate on what exactly didn't work? Where are you loosing the base modification calls? You should definitely not need to basecall your data twice. At the very least if there is a step where you loose the MM/ML/MN tags you can use modkit repair to project the base modifications back onto the reads.

nanoporetech / modkit