lh3 / bwa

Burrow-Wheeler Aligner for short-read alignment (see minimap2 for long-read alignment)
GNU General Public License v3.0
1.54k stars 556 forks source link

bwa aln returning less alignments for higher number of mismatches #358

Open Jfortin1 opened 2 years ago

Jfortin1 commented 2 years ago

While aligning a 19nt-long sequence to the human reference genome (hg38) using bwa aln, I get the unexpected results to get less reported alignments when using 6 mismatches instead of 5.

My toy example fastq file (toyExample.fastq):

@AGCATGGGGAGCTCCCGGG
AGCATGGGGAGCTCCCGGG
+AGCATGGGGAGCTCCCGGG
~~~~~~~~~~~~~~~~~~~

My code for 6 mismatches:

bwa aln -n 6 -k 0 -l 100 -o 0 -N myBWAIndex toyExample.fastq > out_n6.sai

I set the seed length to a large number to ignore seed-specific constraints.

Version: 0.7.17-r1198-dirty

mikiger commented 2 years ago

I also just run into that issue. Happens to me with many different short sequences of lengths 25-30 against hg19 reference, where I get less hits when running with 5 mismatches than with 4 mismatches (with less mismatches it behaves as expected).

Here is an example: @read1 GTGGGCGGGCAGGTGCAGGTGGGT + ########################

With 4 mismatches I get 501 hits in X1: (bwa aln -N -n4 hg19.fa read1.fastq): 19:7591367-7591594/16,24/+ 0 19 7591571 0 24M * 0 0 GTGGGCGGGCAGGTGCAGGTGGGT ######################## XT:A:U NM:i:0 X0:i:1 X1:i:501

With 5 mismatches I get only 39 hits in X1: (bwa aln -N -n4 hg19.fa read1.fastq): 19:7591367-7591594/16,24/+ 0 19 7591571 7 24M * 0 0 GTGGGCGGGCAGGTGCAGGTGGGT ######################## XT:A:U NM:i:0 X0:i:1 X1:i:39

bwa version: 0.7.17-r1188

npsonis commented 1 year ago

-n 5 means that you will be able to map 95% of your reads successfully, assuming a 2% error rate. -n 6 means that you will be able to map 94% of your reads successfully, assuming a 2% error rate. That's why you get less hits. It is not # of mismatches