Assign single-end multimappers randomly to one of the candidate mapping locations

ksahlin / strobealign

Aligns short reads using dynamic seed size with strobemers

MIT License

128 stars 16 forks source link

Assign single-end multimappers randomly to one of the candidate mapping locations #360

Closed marcelm closed 7 months ago

marcelm commented 7 months ago

As before, we iterate over NAMs, compute alignments and keep track of the currently best one.

This PR also handles alignments that have the same score as the currently best one. When found, we use reservoir sampling to uniformly pick one of them.

This does not use the NAM score, but the alignment scores, which means we need to compute the alignments. We compute the alignments anyway, but it may slow things down in a certain case: I had to restrict the optimization that allowed us to stop computing alignments when we have found two exact matches (i.e. edit distance zero). Now we need to compute all exact matches until we find a nonexact one because we need to pick among the exact ones.

I will need to measure how much this slows things down.

See #359

marcelm commented 7 months ago

The CI "compare" job fails due to #361. That runs on paired-end reads only anyway, so here are the results for running tests/compare-baseline.sh -s locally (-s switches it to single-end mode):

Before/after comparisons

       9643 reads were unmapped before and after
          0 reads became mapped
          0 reads became unmapped
      80139 reads were mapped to same locus before and after
*     10218 reads were multimapper before and after, same alignment score (AS)
          0 reads were multimapper before and after, better alignment score (AS)
          0 reads were multimapper before and after, worse alignment score (AS)
          0 reads changed in another way
     100000 total reads

That is exactly as expected: Many reads that are multimappers now get assigned to a different location, but all of them still have the same alignment score as before.

ksahlin commented 7 months ago

As discussed over email:

I would prefer to solve this at the NAM stage. The goal is alignment to many similar genomes (which btw we should at a standard test instance with alignment to 10 or 50 E. coli genomes), and this would cause a drastic slowdown in such cases.
One way to address the issue is to shuffle all top scoring NAMs, but keep all other functionality as is. I predict this would solve the large majority of cases without offering a noticeable slowdown. The cases it wouldn't work would be: we get three NAMs that have (NAM-) scores 100, 100, 90, but the alignments all have the same score. Then just shuffling the top-scoring NAMs gives us the first or second alignment, but not the third. I am hoping such cases are rare.

marcelm commented 7 months ago

Superseded by #364.