lh3 / bwa

Burrow-Wheeler Aligner for short-read alignment (see minimap2 for long-read alignment)
GNU General Public License v3.0
1.54k stars 557 forks source link

Different alignment numbers to same reference sequence #360

Open stewalsh04 opened 2 years ago

stewalsh04 commented 2 years ago

I am using BWA-mem in quite an unusual situation, I am aligning small sequences of around 150bp to small reference sequences of variable length. When I give a reference file with multiple similar reference sequences my overall alignment rates increase considerably.

ref1.fasta:

seq1 CAGGCTCTGCTCTTCATAATCATACCTTTGTGACTCAGGATGCTGT

seq2 CAGGCTCTGCTCTTAATATCTGGCCGTCGTATTCCACCTCTGCGACTCATGATGCTGT (100,000 aligned)

seq3 CAGGCTCTGCTCTTCATAATTTCTATCTTGCCCACCCTACTCGACACAGAGCAAAAATCCAACACTCCCAATATTGCCGTGGCTTCGACCTCTTGCTCAGATTTTCTTGTTACCTTTGTGACTCAGGATGCTGT

seq4 CAGGCTCTGCTCTTCATAACCCTCCCTGCGAGTCCTTAAGTCTGACTCGGATCCTTAAACAACCTTTTCTTACCTTTGTGACTCAGGATGCTGT

ref2.fasta:

seq2 CAGGCTCTGCTCTTAATATCTGGCCGTCGTATTCCACCTCTGCGACTCATGATGCTGT (25,000 aligned)

My fastq files align at higher numbers to ref1.fasta than ref2.fasta, but allow a far greater number of deletions and mis-matches with ref1.fasta.

I realize this is not what BWA-mem was really designed to do, so I am not so much as looking for a bug to be fixed, but would be really grateful if you could help explain this activity, could it be something to do with the initial seeding of the alignment?

Many thanks, Steve W.