nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
531 stars 63 forks source link

Duplicated alignments with Dorado aligner #587

Closed jamez-eh closed 9 months ago

jamez-eh commented 10 months ago

Hello,

I am testing dorado aligner to carry across methylation tags between bams when realigning mod bams after basecalling. I am using Dorado 0.3.4

/gsc/software/linux-x86_64-ubuntu20/dorado-0.3.4/bin/dorado aligner \
    --max-reads 10000000 \
    /projects/alignment_references/9606/hg19a/genome/minimap2-2.15-map-ont/hg19a_map-ont.mmi \
    $ORIGINAL_BAM > ${SCRATCH_SPACE}/new.bam

/gsc/software/linux-x86_64-centos7/samtools-1.19/bin/samtools sort -n ${ORIGINAL_BAM} > ${SCRATCH_SPACE}/original_sorted.bam
/gsc/software/linux-x86_64-centos7/samtools-1.19/bin/samtools sort -n ${SCRATCH_SPACE}/new.bam > ${SCRATCH_SPACE}/new_sorted.bam

/gsc/software/linux-x86_64-centos7/samtools-1.19/bin/samtools view -F 2304 ${SCRATCH_SPACE}/new_sorted.bam | head -n 20 > new_head.txt
/gsc/software/linux-x86_64-centos7/samtools-1.19/bin/samtools view -F 2304 ${SCRATCH_SPACE}/original_sorted.bam | head -n 20 > original_head.txt

The original bam was aligned with /projects/alignment_references/9606/hg19a/genome/minimap2-2.15-map-ont/hg19a_map-ont.mmi and I am attempting to reproduce it. It successfully transfers over the modification tags.

In the heads of these bams, filtered for primary alignments I see the same alignments, but also duplicates and reorientations:

Original:

0000c0b2-7b1b-4107-abd5-2e3e666c840f    16    2    175086954    60
00b0a605-63b8-4c16-8b31-46b737d3e6d3    0    15    53816457    60
00b88654-2714-4829-b3ba-1793536058cf    0    7    66988816    60
00ba03a6-c885-4b81-a1f0-2725500bcfbd    0    4    26711082    60
00bc722a-1d39-4126-91ee-1968ffb92e09    16    20    60667376    48
00bdfb28-7a0c-4205-b6fb-7771403126fc    16    8    43545860    60
00bf33f3-137b-4304-90ed-92e7da9c25ed    16    17    8746959    60
00c08cb5-838d-43c7-beab-d8e6f60a465f    16    7    78130691    60
00c3212c-50a6-4d00-935d-a78e3421cc3d    16    15    101425749    60
00c6106f-2aa3-46af-ac8e-f75267b2fc32    16    X    137070941    60 

Realigned:

0000c0b2-7b1b-4107-abd5-2e3e666c840f    0    2    175086954    60
00b0a605-63b8-4c16-8b31-46b737d3e6d3    0    15    53816457    60
00b88654-2714-4829-b3ba-1793536058cf    0    7    66988816    60
00ba03a6-c885-4b81-a1f0-2725500bcfbd    0    4    26711082    60
00bc722a-1d39-4126-91ee-1968ffb92e09    0    20    60667376    48
00bdfb28-7a0c-4205-b6fb-7771403126fc    0    8    43545860    60
00bf33f3-137b-4304-90ed-92e7da9c25ed    0    17    8746959    60   
00bf33f3-137b-4304-90ed-92e7da9c25ed    16    17    8746959    60    
00bf33f3-137b-4304-90ed-92e7da9c25ed    0    17    8746959    60   
00bf33f3-137b-4304-90ed-92e7da9c25ed    16    17    8746959    60  

Do you have any idea what is happening here any any advice on how to fix this?

Thank you,

James

tijyojwad commented 10 months ago

Was the original alignment done with minimap2 or with dorado aligner as well? If it was minimap2 what options were used for that run?

My initial guess is that if the original BAM had some secondary/supplementary alignments, dorado aligner is treating each record as a new read and re-aligning all of those thereby creating duplicate primary/secondary/supplementary alignments for each of them. Have you tried filtering only the primary alignments from the original BAM and aligning only those?

jamez-eh commented 9 months ago

The original was created by dorado when basecalling. I am running with all default settings, which are the same as when basecalling. I have not tried filtering the bam for primary reads, but also suspect that might be what is occurring here.

tijyojwad commented 9 months ago

Gotcha - unfortunately dorado isn't designed to filter out duplicate records if the input data has duplicates. My suggestion would be to remove duplicate entries before re-aligning the BAM files