Closed marcelm closed 10 months ago
Welcome back from vacation :) Great that you found a bug and I like most of your solution! I think I like the linear projection too.
For example, if a 150nt read has a perfect match to one place and one SNP to another, then there is a still preference reflected in the MAPQ score to the perfect location 60 * (300-290)/300 = 2
if I have understood correctly. But what happens to reads with a perfect mapping as primary but with a secondary containing a softclipp?
For example, if a 150nt read has a perfect match to one place and one SNP to another, then there is a still preference reflected in the MAPQ score to the perfect location
60 * (300-290)/300 = 2
if I have understood correctly.
Yes. For longer reads, though, the situation would become as described in my example (with the score 595 vs 600), so you would get MAPQ zero again.
Hm, maybe dividing by the best score isn’t such a good idea after all? Consider the same situation for a 1000 nt read: If it has one location where it maps perfectly and one where it maps with one mismatch, then I would be quite confident that the location without the mismatch is the correct one. And my (intuitive) level of confidence would be quite similar even when the read is significantly shorter.
But also, this changes when there are more mismatches: If one location gives 9 and another 10 mismatches, I’d say the difference doesn’t matter that much and both are almost equally good. (Unless perhaps if the 9 mismatches were shared between the two, but taking that into account goes too far here.)
But what happens to reads with a perfect mapping as primary but with a secondary containing a softclipp?
The secondary should have a much lower score, so the mapping quality would be high. That is the desired outcome, right?
Hm, maybe dividing by the best score isn’t such a good idea after all?
I am not sure what would be a good MAPQ score calculation. I just found this biostars post but haven't studied it closely. Looks a bit like magic and claims to come from this paper.
The secondary should have a much lower score, so the mapping quality would be high. That is the desired outcome, right?
Yes.
Getting back to this as I’m working on #359, where I need to touch the same code: Can this PR be merged? My feeling is that the PR is maybe not the best thinkable solution, but an improvement over the status quo.
Can this PR be merged?
Yes. However, as discussed over email, it is probably good to add the case: if mapq == 0 and best_score > second_best_score then MAPQ=1
to avoid the linear interpolation assigning MAPQ 0 when there is an actual score difference.
Is it ok If I use the following alternative (pushed in commit acac70beb70cabf9bb2b3474270bd3bceeebd38f)? The idea is to round up instead of down. It has the same effect, that is, we get MAPQ=0 only if second_best == best:
>>> def f(best, second_best):
... return (60 * (best - second_best) + best - 1) // best
...
>>> f(100, 0)
60
>>> f(100, 50)
30
>>> f(100, 100)
0
>>> f(10000, 9999)
1
I checked the distribution of MAPQ values on the drosophila test dataset (the one on which the CI baseline comparison script runs).
Great, I agree on your formula which is better than what I suggested. About the plot: seems like a sound distribution assuming there are repeats with 1,2,3,... mutations (and indels) to the best location.
approved to merge.
This contains two changes: A bugfix and a change to the way MAPQ values are computed for single-end mapping.
First, the fix: There was a bug in MAPQ computation due to initializing
best_score
to -1000. In the first iteration,min_mapq_diff
is updated in this way:With
best_score=-1000
, this is justalignment.score + 1000
, so most of the time,min_mapq_diff
ends up being something like 1600 or so for typical read lengths. This is then clamped to 60 in the output, which is probably the reason why we see MAPQ=60 most of the time.Even with the fix, I didn’t understand what the code was doing. I think the intention is to use the second best score and relate it to the best score to compute the mapping quality, but it appears this isn’t implemented that way. So the second change is to compute MAPQ values differently:
best_score
andsecond_best_score
.second_best_score=0
andsecond_best_score=best_score
, where anything in between is interpolated:To me, the two extreme cases (MAPQ=0 and MAPQ=60) make sense.
I’m not so sure about the linear interpolation, but it also seems to give the results one would want in some cases. For example, if
best_score
is 600 andsecond_best_score
is 595, the formula gives60 * (600-595)/600 = 0.5
, which is rounded down and gives mapq=0, so if the scores are very close, the read is still considered a multimapper.This is intended to address the single-end part of #25.