ksahlin / strobealign

Aligns short reads using dynamic seed size with strobemers
MIT License
128 stars 16 forks source link

Duplicate alignments lead to spurious multimappers #334

Open marcelm opened 10 months ago

marcelm commented 10 months ago

This happens when mapping single-end reads (I have not tested paired ends).

Sometimes, the list of NAMs contains different NAMs that lead to the same alignment. This is visible when running strobealign with -N (to output secondary alignments):

$ build/strobealign --eqx -N 4 -t 8 drosophila/ref.fasta drosophila/reads.1.fastq.gz | grep 'SRR6055476.10\b'
[...]
SRR6055476.10   16      NT_033779.5     14293349        0       42=2X31=1X72=3S *       0       0       TTTTAGCTGCTCGTAAACCGAAATCTCCCAAGGAGATGCAAACATTCTGCCAGATGATGGAAAGATTGGGGGAGATGTGAATTGAGTGTCATTGACAACAGAGTGCTTCATTTGATGGCTCCTGGGGCAAACTTCCCATGGCAAATGTTTN JFFFJJJJJJJJF<F<7FA<7<AFAJA<F<JJJFFAFFAFF<FJJJJJJFAAJFFJAJA7-7-FFJJFFA<-77--JJ<<-FJJJJJFAJJFJJA-JJJJJJJFJFFF-JJJJJFAJJJJJJJJJJJJJJAJJJJJFJJJJJJJJJFF<<#     NM:i:3  AS:i:276
SRR6055476.10   272     NT_033779.5     14293349        255     42=2X31=1X72=3S *       0       0       *       *  NM:i:3   AS:i:276
SRR6055476.10   272     NT_033779.5     14293349        255     42=2X31=1X72=3S *       0       0       *       *  NM:i:3   AS:i:276

The fact that the same alignment is reported is a problem, but also, because there is more than one equally good alignment, the primary alignment gets a mapping quality of zero as if it were a multimapper although it isn’t.

Also, multiple (gapped) alignments are computed although one would be enough.

ksahlin commented 10 months ago

Nice catch! Do you think this happens because of FW and RC matches that gets collapsed into different NAMs (collisions due to canonical represented seeds)? I guess this would lead to overlapping NAMs.

marcelm commented 7 months ago

There is some extra code in the paired-end alignment function that takes care of duplicate alignments, so this issue is only relevant for single end reads.