Closed marcelm closed 7 months ago
The CI "compare" job fails due to #361. That runs on paired-end reads only anyway, so here are the results for running tests/compare-baseline.sh -s
locally (-s
switches it to single-end mode):
Before/after comparisons
9643 reads were unmapped before and after
0 reads became mapped
0 reads became unmapped
80139 reads were mapped to same locus before and after
* 10218 reads were multimapper before and after, same alignment score (AS)
0 reads were multimapper before and after, better alignment score (AS)
0 reads were multimapper before and after, worse alignment score (AS)
0 reads changed in another way
100000 total reads
That is exactly as expected: Many reads that are multimappers now get assigned to a different location, but all of them still have the same alignment score as before.
As discussed over email:
I would prefer to solve this at the NAM stage. The goal is alignment to many similar genomes (which btw we should at a standard test instance with alignment to 10 or 50 E. coli genomes), and this would cause a drastic slowdown in such cases.
One way to address the issue is to shuffle all top scoring NAMs, but keep all other functionality as is. I predict this would solve the large majority of cases without offering a noticeable slowdown. The cases it wouldn't work would be: we get three NAMs that have (NAM-) scores 100, 100, 90, but the alignments all have the same score. Then just shuffling the top-scoring NAMs gives us the first or second alignment, but not the third. I am hoping such cases are rare.
Superseded by #364.
As before, we iterate over NAMs, compute alignments and keep track of the currently best one.
This PR also handles alignments that have the same score as the currently best one. When found, we use reservoir sampling to uniformly pick one of them.
This does not use the NAM score, but the alignment scores, which means we need to compute the alignments. We compute the alignments anyway, but it may slow things down in a certain case: I had to restrict the optimization that allowed us to stop computing alignments when we have found two exact matches (i.e. edit distance zero). Now we need to compute all exact matches until we find a nonexact one because we need to pick among the exact ones.
I will need to measure how much this slows things down.
See #359