ksahlin / strobealign

Aligns short reads using dynamic seed size with strobemers
MIT License
142 stars 17 forks source link

Rework single-end MAPQ computation #328

Closed marcelm closed 10 months ago

marcelm commented 1 year ago

This contains two changes: A bugfix and a change to the way MAPQ values are computed for single-end mapping.

First, the fix: There was a bug in MAPQ computation due to initializing best_score to -1000. In the first iteration, min_mapq_diff is updated in this way:

min_mapq_diff = std::max(0, alignment.score - best_score);

With best_score=-1000, this is just alignment.score + 1000, so most of the time, min_mapq_diff ends up being something like 1600 or so for typical read lengths. This is then clamped to 60 in the output, which is probably the reason why we see MAPQ=60 most of the time.

Even with the fix, I didn’t understand what the code was doing. I think the intention is to use the second best score and relate it to the best score to compute the mapping quality, but it appears this isn’t implemented that way. So the second change is to compute MAPQ values differently:

To me, the two extreme cases (MAPQ=0 and MAPQ=60) make sense.

I’m not so sure about the linear interpolation, but it also seems to give the results one would want in some cases. For example, if best_score is 600 and second_best_score is 595, the formula gives 60 * (600-595)/600 = 0.5, which is rounded down and gives mapq=0, so if the scores are very close, the read is still considered a multimapper.

This is intended to address the single-end part of #25.

ksahlin commented 1 year ago

Welcome back from vacation :) Great that you found a bug and I like most of your solution! I think I like the linear projection too.

For example, if a 150nt read has a perfect match to one place and one SNP to another, then there is a still preference reflected in the MAPQ score to the perfect location 60 * (300-290)/300 = 2 if I have understood correctly. But what happens to reads with a perfect mapping as primary but with a secondary containing a softclipp?

marcelm commented 1 year ago

For example, if a 150nt read has a perfect match to one place and one SNP to another, then there is a still preference reflected in the MAPQ score to the perfect location 60 * (300-290)/300 = 2 if I have understood correctly.

Yes. For longer reads, though, the situation would become as described in my example (with the score 595 vs 600), so you would get MAPQ zero again.

Hm, maybe dividing by the best score isn’t such a good idea after all? Consider the same situation for a 1000 nt read: If it has one location where it maps perfectly and one where it maps with one mismatch, then I would be quite confident that the location without the mismatch is the correct one. And my (intuitive) level of confidence would be quite similar even when the read is significantly shorter.

But also, this changes when there are more mismatches: If one location gives 9 and another 10 mismatches, I’d say the difference doesn’t matter that much and both are almost equally good. (Unless perhaps if the 9 mismatches were shared between the two, but taking that into account goes too far here.)

But what happens to reads with a perfect mapping as primary but with a secondary containing a softclipp?

The secondary should have a much lower score, so the mapping quality would be high. That is the desired outcome, right?

ksahlin commented 1 year ago

Hm, maybe dividing by the best score isn’t such a good idea after all?

I am not sure what would be a good MAPQ score calculation. I just found this biostars post but haven't studied it closely. Looks a bit like magic and claims to come from this paper.

The secondary should have a much lower score, so the mapping quality would be high. That is the desired outcome, right?

Yes.

marcelm commented 10 months ago

Getting back to this as I’m working on #359, where I need to touch the same code: Can this PR be merged? My feeling is that the PR is maybe not the best thinkable solution, but an improvement over the status quo.

ksahlin commented 10 months ago

Can this PR be merged?

Yes. However, as discussed over email, it is probably good to add the case: if mapq == 0 and best_score > second_best_score then MAPQ=1 to avoid the linear interpolation assigning MAPQ 0 when there is an actual score difference.

marcelm commented 10 months ago

Is it ok If I use the following alternative (pushed in commit acac70beb70cabf9bb2b3474270bd3bceeebd38f)? The idea is to round up instead of down. It has the same effect, that is, we get MAPQ=0 only if second_best == best:

>>> def f(best, second_best):
...  return (60 * (best - second_best) + best - 1) // best
... 
>>> f(100, 0)
60
>>> f(100, 50)
30
>>> f(100, 100)
0
>>> f(10000, 9999)
1

I checked the distribution of MAPQ values on the drosophila test dataset (the one on which the CI baseline comparison script runs).

ksahlin commented 10 months ago

Great, I agree on your formula which is better than what I suggested. About the plot: seems like a sound distribution assuming there are repeats with 1,2,3,... mutations (and indels) to the best location.

approved to merge.