ksahlin / strobealign

Aligns short reads using dynamic seed size with strobemers
MIT License
128 stars 16 forks source link

Assign single-end multimappers randomly to one of the candidate mapping locations #360

Closed marcelm closed 7 months ago

marcelm commented 7 months ago

As before, we iterate over NAMs, compute alignments and keep track of the currently best one.

This PR also handles alignments that have the same score as the currently best one. When found, we use reservoir sampling to uniformly pick one of them.

This does not use the NAM score, but the alignment scores, which means we need to compute the alignments. We compute the alignments anyway, but it may slow things down in a certain case: I had to restrict the optimization that allowed us to stop computing alignments when we have found two exact matches (i.e. edit distance zero). Now we need to compute all exact matches until we find a nonexact one because we need to pick among the exact ones.

I will need to measure how much this slows things down.

See #359

marcelm commented 7 months ago

The CI "compare" job fails due to #361. That runs on paired-end reads only anyway, so here are the results for running tests/compare-baseline.sh -s locally (-s switches it to single-end mode):

Before/after comparisons

       9643 reads were unmapped before and after
          0 reads became mapped
          0 reads became unmapped
      80139 reads were mapped to same locus before and after
*     10218 reads were multimapper before and after, same alignment score (AS)
          0 reads were multimapper before and after, better alignment score (AS)
          0 reads were multimapper before and after, worse alignment score (AS)
          0 reads changed in another way
     100000 total reads

That is exactly as expected: Many reads that are multimappers now get assigned to a different location, but all of them still have the same alignment score as before.

ksahlin commented 7 months ago

As discussed over email:

marcelm commented 7 months ago

Superseded by #364.