Optimized parameters lead to reduced mapping rate

This was being discussed in a closed PR, so here’s a proper issue for it as I am not sure what the conclusion is.

PR #345 introduced syncmer/strobemer generation parameters that maximize accuracy, but as can be seen in percentage_aligned_plot.pdf, this leads to a reduced mapping rate.

As @ksahlin writes in https://github.com/ksahlin/strobealign/issues/345#issuecomment-1739610142:

Interesting phenomenon. I had not anticipated this. While high precision is good, here is my take:

While I generally agree that accuracy (as measured by tot_correct_mapped/tot_reads) is the metric to primarily optimise for, there are situations (perfect repeats) where more mapped reads are good such as for coverage based applications. While I don't have data to support, it is likely that most of the loss in mapped reads was to repetitive regions (including perfect repeats).

If could be that much smaller seeds lead to many hits going beyond the hard threshold (1000?) and simply become unmapped (strobealign gives up). While it is okay to take some loss in percentage mapped, it is worth monitoring. The drop we saw is probably acceptable (borderline) given the increase in accuracy.

I am not sure why the decline is so big even for 100nt reads. I will try the next commit (after your optimization) and see the final damage in percentage mapped at that point.

Some thoughts:

If a higher mapping rate is beneficial in some cases (even at the cost of some accuracy), we should measure this somehow. How?
Do we need a strategy to deal with this in future optimization efforts? The parameter optimization script could be changed to also report mapping rate.
We should find out what is going on exactly. Look at a couple of reads that become unmapped with the new parameter settings and see what happens.
Can we distinguish multimappers from uniquely mapping reads when measuring accuracy? We would need a better ground truth than what we have. This is related to #341 as both require generating a ground truth.

Not sure I have great ideas at the moment, but here are some thoughts:

Point 1 and 4: Higher mapping rate (at similar accuracy) is likely only desired in the case of perfect repeats. How about flagging multimapping reads (Mapq=0) using either strobealign baseline run with -N 100 -M 100 or BWA-MEM reporting more than one alignment? You had a script doing a similar type of analysis I remember. If strobealign is run in the above suggestion, we could compare the alignment scores between the two versions and mark the reads as correct and multimapper if the score is identical.

Point 2: Accuracy should be the primary optimisation metric, but perhaps mapped reads (%) could be output as well?

Point 3: Agree that it would be important, specifically since we aim for larger reference databases.

ksahlin / strobealign

Optimized parameters lead to reduced mapping rate #354