ksahlin / strobealign

Aligns short reads using dynamic seed size with strobemers
MIT License
141 stars 17 forks source link

Optimize parameters – second attempt #376

Closed marcelm closed 9 months ago

marcelm commented 9 months ago

This is to discuss a new version of the parameter optimization PR #345, which was reverted because the mapping rate decreases, see #354.

I have updated the parameter optimization script such that it tries to maximize score-based accuracy. Score-based accuracy counts a read as mapped correctly if the score of the alignment is at least as good as the score that the alignment has at the true location.

Here are the suggested new parameters. This time, normal accuracy, score-based accuracy ("s.acc") and mapping rate ("maprate") are shown.

Numbers are not absolute, but differences to the baseline (unoptimized parameters). read length Current parameters suggested s.acc SE s.acc PE accuracy SE accuracy PE maprate SE maprate PE
50 (20, 16, -3, 2) (18, 14, -2, 1) +0.8532 +0.1314 +1.0367 +0.3163 +0.952 +0.048
75 (20, 16, -3, 2) nothing found
100 (20, 16, -2, 2) nothing found
150 (20, 16, 1, 7) nothing found
200 (20, 16, 4, 13) (22, 18, 2, 12) +0.0162 +0.0083 +0.0105 +0.0094 +0.002 +0
300 (22, 18, 2, 12) (24, 20, 5, 13) +0.0807 +0.0234 +0.0438 +0.0315 0 0
500 (23, 17, 2, 12) (24, 18, 6, 13) +0.0576 +0.0233 +0.0451 +0.0248 0 0
ksahlin commented 9 months ago

Nice gain for the 50nt reads! Do all the columns need to be positive to switch to a better setting?

How do you explore the parameter space?

marcelm commented 9 months ago

Only s.acc SE and s.acc SE need to be positive, I just report the others (we "got lucky" that they also are positive).

The parameter space for read length 50 is generated thus:

The script then tests all valid combinations on the dataset that is fastest to run (single-end drosophila) and discards those combinations that lead to worse results. With the combinations that are left, it goes to the next bigger dataset and does the same thing, and so on until only a few parameter combinations are left and we reach paired-end rye. The final combinations are then sorted by weighted average accuracy (PE + 0.25 * SE) and printed. And as you see above, for read lengths 75, 100, 150, only the original parameter combination was left. (It looks like the same is going to happen for 200.)

ksahlin commented 9 months ago

I see, makes sense!

Not to be a PITA but it seems optimum for 50nt is found at boundary, how about also trying k=17?

marcelm commented 9 months ago

Good point! I added k=17 now. For single-end reads, (17, 13, -2, 2) is actually even better at +1.1250 score-based accuracy, +1.4751 normal accuracy, +1.2543 mapping rate. For paired-end reads (and for the weighted average), the parameters I wrote in the table above are still best.

marcelm commented 9 months ago

Optimization is now finished for all read lengths. Read length 50 benefits most and the gains for the others are less than 0.1 percentage points if any.

ksahlin commented 9 months ago

Great. How about implementing the new parameter settings for the 50nt and 200nt read lengths? As for 300nt and 500nt, I am a little bit concerned raising l too much if we want to find shorter split read alignment. (We talked about it before).

marcelm commented 9 months ago

Implemented in #379