Closed marcelm closed 10 months ago
Great catch, these are the kinds of examples we want to understand how we could improve MAPQ values.
Possibly this is a regression introduced by one of the changes I made; I will need to check this.
Just a note: Version 0.7.1 has the same behavior, so this is not a regression.
One (the only?) explanation that some reads get 60 instead of 0 MAPQ is that strobealign it exits early if it finds a perfect alignment without any edits and reports the alignment (with MAPQ 60). To report 0 we would need to look at at least one more alignment site, just like BWA-MEM does.
This should not result in a big additional computational cost as we do hamming distance for these alignments (although it will increases SSW calls a bit for the reads where the second site contains indels)
As discussed offline, this may be a single-end alignment issue only.
I have done some experiments and can confirm that this is not an issue for paired-end data.
Apparently some (possibly many) reads get MAPQ 60 although they actually are multimappers that should get MAPQ 0.
I noticed this when running the
samdiff.py
script to compare two strobealign commits. In one commit, I had changed the randstrobe hash to an asymmetric one in order to avoid getting the spurious hits on the reverse complement. The exact change doesn’t matter so much because I now realize I have seen the following earlier when tweaking the mapping parameters or the algorithm.The samdiff script often outputs changes that look like this:
So in one version of the program, the read
simulated.61
is mapped to 2L:1680535 and in the other, it is mapped to2R:19717413
. In both cases, however, the alignment score is 120, which means that the loci are actually equally good (and in this case, they are even identical). That is, since the read maps equally well to both locations, MAPQ should be 0.One explanation could be that the program simply doesn’t find the respective other locus, but at least for the above read, both versions find both loci at the same score. The set of produced NAMs is essentially the same if one looks only at the highest-scoring ones, they just are found in a different order.
Possibly this is a regression introduced by one of the changes I made; I will need to check this.
See also #25