ksahlin / strobealign

Aligns short reads using dynamic seed size with strobemers
MIT License
128 stars 16 forks source link

Avoid MAPQ=60 for single-end multimappers #327

Closed marcelm closed 10 months ago

marcelm commented 10 months ago

by ensuring that we compute at least two alignments even if the first that is found has edit distance 0.

Closes #293

Because I still need to test whether this actually fixes #293, I’m marking this as draft.

ksahlin commented 10 months ago

approved.

marcelm commented 10 months ago

(Note that I had a typo/logic error in the PR and pushed a fix.)

I tried to do some measurements.

It’s a bit hard to inspect multimappers only, so I looked at incorrectly mapped reads instead, which should all be multimappers: Assuming a couple of things about how the read mapper works, these should have at least two roughly equivalent mapping locations (from the read mapper’s point of view): The correct one and the one the read mapper actually outputs. So they should get MAPQ=0.

So issue #293 is essentially that we get many incorrectly mapped reads that are assigned MAPQ>0, which should actually have been assigned MAPQ=0.

I generated a test dataset with know mapping locations by picking every tenth 100-mer from the Drosophila reference (about 14.4 million reads). Then I could count incorrectly mapped reads that have MAPQ>0 and MAPQ=0, respectively. For good measure, I also counted the correctly mapped reads with MAPQ>0 and MAPQ=0. The results are as follows:

Read category main (40748fe) this PR (097807b5cf9c4f8fb00c2e1d20f29ac8be776da4) BWA-MEM
Incorrect and MAPQ>0 (bad) 9.87% 0.54% 0%
Incorrect and MAPQ=0 (good) 0.03% 9.36% 9.94%
Correct and MAPQ>0 90.08% 87.20% 86.63%
Correct and MAPQ=0 0% 2.88% 3.41%

I think the relevant numbers are the two bold ones: The "bad" alignments are much closer to zero and the number of "good" ones has gone up. Also, all four percentages are now closer to BWA-MEM.

Note that the percentage of "correct and MAPQ>0" goes down (and "correct and MAPQ=0" up accordingly). I assume these are mostly multimappers that strobealign by chance placed into the "correct" position (the one from which the k-mer was sampled). They previously got MAPQ=60, but since they are multimappers, they should get MAPQ=0, which they now do.

ksahlin commented 10 months ago

great improvement, approved!

marcelm commented 10 months ago

I forgot to add a changelog entry in this PR and took the liberty of pushing that directly to main without a PR.

marcelm commented 10 months ago

I realized I hadn’t check the performance impact of this change, so here are some measurements on 100 bp reads.

These measurements were quite noisy, especially for rye.

ksahlin commented 10 months ago

Ok, worth it.

marcelm commented 10 months ago

Agreed!