lh3 / bwa

Burrow-Wheeler Aligner for short-read alignment (see minimap2 for long-read alignment)
GNU General Public License v3.0
1.54k stars 556 forks source link

Multimapping and reference sequence junctions #379

Open s-a-nersisyan opened 2 years ago

s-a-nersisyan commented 2 years ago

Hello,

I've experienced an issue which could be IMHO treated as a bug. As documented, bwa concatenates all sequences from the reference fasta file, and some reads could map to the junction of two reference sequences – in this case the read is marked as unmapped (SAM flag = 4), though it still reports the position.

The problem arises with multiple mappers. As documented, bwa randomly selects one of the multiple alignments as the “primary” one, leaving the others in XA tag. The trick is that sometimes the “primary” random chosen position is exactly the undesired junction, and at the same time some good alignments are marked as secondary and stored in XA. In this case we think that the read is unmapped (flag = 4), though it actually could be mapped (sometimes even without errors). Moreover, if we have the same molecule present multiple times in fastq file, some of the reads could be mapped, while other (exactly same) reads will be unmapped – this is a random process. The only solution I invented to tackle this problem is to add poly-N sequences to the ends of each sequence in fasta file.

I understand that this bug is very-very unlikely in the most of real-world setups. However, I massively experienced this while trying to align short reads to Y RNA sequences: there are lot of repeats, and a lot of reads mapped to the Y RNA molecule junctions.