nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
491 stars 59 forks source link

Dorado Aligner Producing Unexpected Behavior #902

Closed VBHerrenC closed 3 months ago

VBHerrenC commented 3 months ago

Issue Report

Please describe the issue:

When examining the read length distribution before and after using dorado aligner, the read length distribution changes dramatically, adding a major peak around 2,000 nt.

Steps to reproduce the issue:

Each dorado command was run. For the second dorado command, we converted the bam to a fastq using samtools bam2fq dorado_fast_qFilterLow_noTrim.bam > dorado_fast_qFilterLow_noTrim_convert.fastq. The converted FASTQ was run through the same read length analysis script and produced the exact same graph as the first command. We then used dorado aligner to align the bam and ran it through a similar read length analysis script to produce the second graph with the additional peak.

Run environment:

dorado basecaller ~/packages/dorado-0.7.2-linux-x64/models/dna_r10.4.1_e8.2_400bps_fast@v5.0.0 \
    /home/20240617_1301_MC-115154_FAZ22947_a19f43b9/pod5 \
    --min-qscore 10 \
    --no-trim \
    --emit-fastq \
    --verbose > dorado_fast_qFilterLow_noTrim.fastq
dorado basecaller ~/packages/dorado-0.7.2-linux-x64/models/dna_r10.4.1_e8.2_400bps_fast@v5.0.0 \
    /home/20240617_1301_MC-115154_FAZ22947_a19f43b9/pod5 \
    --min-qscore 10 \
    --no-trim \
    --verbose > dorado_fast_qFilterLow_noTrim.bam

dorado aligner -o dorado_aligner_testing refFasta.fasta dorado_fast_qFilterLow_noTrim.bam

Logs

dorado aligner -o dorado_aligner_testing refFasta.fasta dorado_fast_qFilterLow_noTrim.bam [2024-06-20 10:40:10.296] [info] Running: "aligner" "-o" "dorado_aligner_testing" "refFasta.fasta" "dorado_fast_qFilterLow_noTrim.bam" [2024-06-20 10:40:10.296] [info] num input files: 1 [2024-06-20 10:40:10.296] [info] > loading index refFasta.fasta [2024-06-20 10:40:10.303] [info] processing dorado_fast_qFilterLow_noTrim.bam -> dorado_aligner_testing/dorado_fast_qFilterLow_noTrim.bam [2024-06-20 10:40:10.976] [info] > starting alignment [2024-06-20 10:40:43.611] [info] > finished alignment [2024-06-20 10:40:43.611] [info] > merging temporary BAM files [2024-06-20 10:41:05.271] [info] > Simplex reads basecalled: 324220 [2024-06-20 10:41:05.271] [info] > total/primary/unmapped 511732/324513/1269

image

image

malton-ont commented 3 months ago

Hi @VBHerrenC,

My first guess would be that the aligner is generating a lot of secondary alignments. You could try filtering these out and see if the read length distribution is closer to what you expect? When you do the conversion to fastq, add the filter flag:

samtools bam2fq -F 0x900 <file>.bam > <file>.fastq
VBHerrenC commented 3 months ago

Hi @malton-ont,

Thanks for the response! This did clear up the issue. It looks like the reference we were using was indexed to where we would expect the middle of the reads to be and this caused the issue. Just for future knowledge, does dorado aligner "create" new reads when they are secondary alignments? I always thought it just added a flag to existing reads but I must be wrong. Thanks for the help!

malton-ont commented 3 months ago

Hi @VBHerrenC,

The output from dorado aligner is consistent with output from minimap2 - secondary and supplementary reads are stored as entirely separate entries in the bam files, with their own alignment information and with the flags value set to indicate the type of alignment.

VBHerrenC commented 3 months ago

Hi @malton-ont,

Understood, thank you. I'll mark this as closed. Appreciate the clarification!