lh3 / bwa

Burrow-Wheeler Aligner for short-read alignment (see minimap2 for long-read alignment)
GNU General Public License v3.0
1.54k stars 557 forks source link

Bug: bwa aln misreports alignment contigs #356

Open mattsoup opened 2 years ago

mattsoup commented 2 years ago

Given the following reference:

>seq1
CTTTCCAATGCAAAGCG
>seq2
CGTAACGTGCTCCGAAC
>seq3
GTGGAAACCCAAGCTAG

and the following unaligned bam file:

@HD VN:1.6  SO:queryname
AAANYG5M5:1:1101:10013:31328    4   *   0   0   *   *   0   0   TGTGGAAACCCAAGCTAG  CCCCCC;CCCCCCCCCCC

running bwa aln and samse:

bwa aln -b -R 1 small_ref.fa small_unaligned.bam > test.sai
bwa samse small_ref.fa test.sai small_unaligned.bam > test.sam

produces the following alignment:

AAANYG5M5:1:1101:10013:31328    20  seq2    17  37  18M *   0   0   TGTGGAAACCCAAGCTAG  CCCCCCCCCCC;CCCCCC  XT:A:U  NM:i:1  X0:i:1  X1:i:0  XM:i:1  XO:i:0  XG:i:0  MD:Z:0C17

So it reports the alignment to seq2, even though the sequence is identical to seq3 except for the extra 'T' at the beginning. Changing the reference to this order:

>seq2
CGTAACGTGCTCCGAAC
>seq1
CTTTCCAATGCAAAGCG
>seq3
GTGGAAACCCAAGCTAG

results in reporting the alignment to seq1 instead. Removing the preceding 'T' from the beginning of the query sequence results in the expected alignment to seq3.

It seems the extra base at the beginning of the query somehow results in reporting the alignment as whichever sequence is before the correct one in the reference fasta.

mattsoup commented 2 years ago

Seems to be related to #222.