ksahlin / strobealign

Aligns short reads using dynamic seed size with strobemers
MIT License
128 stars 16 forks source link

Reads get MAPQ=60 although they are multimappers #293

Closed marcelm closed 10 months ago

marcelm commented 1 year ago

Apparently some (possibly many) reads get MAPQ 60 although they actually are multimappers that should get MAPQ 0.

I noticed this when running the samdiff.py script to compare two strobealign commits. In one commit, I had changed the randstrobe hash to an asymmetric one in order to avoid getting the spurious hits on the reverse complement. The exact change doesn’t matter so much because I now realize I have seen the following earlier when tweaking the mapping parameters or the algorithm.

The samdiff script often outputs changes that look like this:

simulated.61
           MAPQ: 60
          score: 120
             NM: 0
          CIGAR: 50M
            ref: 2L -> 2R
            pos: 1680535 -> 19717413

So in one version of the program, the read simulated.61 is mapped to 2L:1680535 and in the other, it is mapped to 2R:19717413. In both cases, however, the alignment score is 120, which means that the loci are actually equally good (and in this case, they are even identical). That is, since the read maps equally well to both locations, MAPQ should be 0.

One explanation could be that the program simply doesn’t find the respective other locus, but at least for the above read, both versions find both loci at the same score. The set of produced NAMs is essentially the same if one looks only at the highest-scoring ones, they just are found in a different order.

Possibly this is a regression introduced by one of the changes I made; I will need to check this.

See also #25

ksahlin commented 1 year ago

Great catch, these are the kinds of examples we want to understand how we could improve MAPQ values.

marcelm commented 12 months ago

Possibly this is a regression introduced by one of the changes I made; I will need to check this.

Just a note: Version 0.7.1 has the same behavior, so this is not a regression.

ksahlin commented 12 months ago

One (the only?) explanation that some reads get 60 instead of 0 MAPQ is that strobealign it exits early if it finds a perfect alignment without any edits and reports the alignment (with MAPQ 60). To report 0 we would need to look at at least one more alignment site, just like BWA-MEM does.

This should not result in a big additional computational cost as we do hamming distance for these alignments (although it will increases SSW calls a bit for the reads where the second site contains indels)

ksahlin commented 12 months ago

As discussed offline, this may be a single-end alignment issue only.

marcelm commented 11 months ago

I have done some experiments and can confirm that this is not an issue for paired-end data.