Optimize parameters (again)

marcelm commented 3 months ago

Here are suggested new indexing parameters for all read lengths.

This supersedes #397.

I ran the optimization script for both v0.12.0 (commit 6fd4c5d) and multi-context seeds (commit c4a7f61).

Differences to #397:

"accuracy slack" is set to 0.1: The accuracy of a single dataset may drop 0.1 percentage points below the baseline without being excluded from further consideration. This is intended to avoid running into local maxima.
Optimization criterion is regular accuracy, not score-based accuracy.

Command used:

./search.py -c ${commit} -x --accuracy-slack 0.1 --mapping-rate-slack 1 -r ${read_length}

Suggested changes

Parameters are given as a tuple $(k, s, l, u)$.

I did not mechanically pick the settings that optimize mapping-only accuracy, but made sure that they also work well for extension alignment mode. Many parameter settings are found that are essentially equally good, so it was possible for me to find settings that work equally well for v0.12.0 and multi-context seeds, except for read lengths 100 and 150.

Readl.	Before	Suggestion	Comment
50	(18, 14, -2, 1)	(16, 12, -2, 0)
75	(20, 16, -3, 2)	(20, 16, -3, -1)	alternative: (21, 17, -3, 1)
100	(20, 16, -2, 2)	(16, 12, 1, 3)	for v0.12. Alternative (17, 13, 1, 3) is very similar
100	(20, 16, -2, 2)	(18, 14, 1, 3)	for multi-context seeds
125	(20, 16, -1, 4)	-	not measured
150	(20, 16, 1, 7)	(20, 16, 2, 5)	for v0.12. Reduces ext. alignment SE accuracy slightly; alternative (20, 16, 2, 8) would not (but improve mapping-only PE accuracy much less)
150	(20, 16, 1, 7)	(22, 18, 3, 5)	for multi-context seeds. Reduces ext. alignment SE accuracy slightly; alternative (23, 19, 2, 7) would not (but improve mapping-only PE accuracy a bit less)
200	(22, 18, 2, 12)	(24, 20, 4, 12)
300	(22, 18, 2, 12)	(24, 20, 5, 13)
500	(23, 17, 2, 12)	(25, 19, 7, 13)

We only have canonical read length 250. Using the interpolated parameters (24, 20, 5, 12) or (24, 20, 4, 12) gives ok results for read lengths 200 and 300.

The script was run in a mode where it optimizes mapping-only accuracy. I am currently running it to optimize extension-aligment accuracy. In theory, the results could be different. So far, for the read lengths that are finished (currently 50, 75, 100), they are not.

Details for v0.12

This shows how mapping-only and extension-alignment accuracy change for the suggested parameters.

Readlen.	kslu	maponly SE	maponly PE	extalign SE	extalign PE
50	(16, 12, -2, 0)	+0.7657	+1.1146	+0.9158	+0.2027
75	(20, 16, -3, -1)	-0.0090	+0.1043	+0.0296	+0.0170
75	(21, 17, -3, 1)	+0.0397	+0.0744	-0.0139	+0.0229
100	(16, 12, 1, 3)	+0.6626	+0.4397	+0.2958	+0.1274
100	(17, 13, 1, 3)	+0.6701	+0.4101	+0.2421	+0.1311
150	(20, 16, 2, 5)	-0.0016	+0.0917	-0.0119	+0.0357
150	(20, 16, 2, 8)	+0.1089	+0.0357	+0.0204	+0.0241
200	(24, 20, 4, 12)	+0.0516	+0.0533	+0.0041	+0.0295
200	(24, 20, 5, 12)	+0.0150	+0.0496
300	(24, 20, 4, 12)	+0.1591	+0.0674
300	(24, 20, 5, 12)	+0.1725	+0.0729
300	(24, 20, 5, 13)	+0.2264	+0.0809	+0.0438	+0.0315
400	(25, 19, 7, 13)	+0.2737	+0.1441	+0.0520	+0.0306

More details

Details have been shortened because GitHub’s maximum comment size was reached.

# v0.12.0

## Read length 50: Weighted SE/PE results - mapping-only

parameters  acc_se  acc_pe  diff_se diff_pe mprt_se mprt_pe diff_se diff_pe
(16, 12, -2, 0) 63.0373 74.6368 +0.7657 +1.1146 96.585  96.573  +0.457  +0.443      pareto
(16, 12, -2, 1) 62.9708 74.5278 +0.6992 +1.0056 96.693  96.680  +0.564  +0.551      
(16, 12, -2, 2) 62.9749 74.5250 +0.7033 +1.0028 96.694  96.682  +0.565  +0.552      
(17, 13, -2, 0) 62.7352 74.2179 +0.4637 +0.6957 96.297  96.293  +0.169  +0.163      
(17, 13, -2, 1) 62.6388 74.0388 +0.3673 +0.5166 96.426  96.421  +0.297  +0.291      
(17, 13, -2, 2) 62.6360 74.0215 +0.3645 +0.4993 96.430  96.426  +0.302  +0.296      
(18, 14, -2, 0) 62.4612 73.7873 +0.1897 +0.2651 95.983  95.983  -0.145  -0.146      
(18, 14, -2, 1) 62.2715 73.5222 -0.0000 +0.0000 96.128  96.130  +0.000  +0.000  *****   
(18, 14, -2, 2) 62.2643 73.4951 -0.0072 -0.0270 96.137  96.139  +0.009  +0.009      

## Read length 50: Weighted SE/PE results - with extension alignment

parameters  sacc_se sacc_pe diff_se diff_pe acc_se  acc_pe  diff_se diff_pe mprt_se mprt_pe diff_se diff_pe
(16, 12, -2, 0) 92.8490 97.3130 +0.4439 -0.0220 66.0610 80.8919 +0.9158 +0.2027 96.585  99.455  +0.457  -0.056      pareto
(18, 14, -2, 1) 92.4051 97.3350 +0.0000 -0.0000 65.1452 80.6892 +0.0000 +0.0000 96.128  99.511  +0.000  +0.000

## Read length 75: Weighted SE/PE results - mapping-only

parameters  acc_se  acc_pe  diff_se diff_pe mprt_se mprt_pe diff_se diff_pe
(20, 16, -3, -1)    71.5283 82.4506 -0.0090 +0.1043 98.817  98.823  -0.182  -0.181      pareto
(21, 17, -3, -1)    71.5770 82.4207 +0.0397 +0.0744 98.747  98.750  -0.252  -0.254      pareto
(20, 16, -3, 0) 71.5360 82.3733 -0.0013 +0.0269 98.979  98.984  -0.020  -0.020      
(20, 16, -3, 2) 71.5373 82.3463 -0.0000 +0.0000 98.999  99.004  +0.000  +0.000  *****   
(20, 16, -3, 3) 71.5373 82.3463 -0.0000 +0.0000 98.999  99.004  +0.000  +0.000      
(20, 16, -3, 1) 71.5378 82.3412 +0.0004 -0.0051 98.999  99.004  -0.000  -0.000      

## Read length 75: Weighted SE/PE results - with extension alignment

parameters  sacc_se sacc_pe diff_se diff_pe acc_se  acc_pe  diff_se diff_pe mprt_se mprt_pe diff_se diff_pe
(20, 16, -3, -1)    95.6523 98.4026 -0.1678 -0.0217 74.8279 86.7332 +0.0296 +0.0170 98.817  99.761  -0.182  -0.029      pareto
(21, 17, -3, -1)    95.6500 98.4203 -0.1701 -0.0040 74.7843 86.7391 -0.0139 +0.0229 98.747  99.754  -0.252  -0.036      pareto
(20, 16, -3, 2) 95.8202 98.4242 +0.0000 +0.0000 74.7982 86.7161 +0.0000 +0.0000 98.999  99.790  +0.000  +0.000  *****   

## Read length 100: Weighted SE/PE results - mapping-only

parameters  acc_se  acc_pe  diff_se diff_pe mprt_se mprt_pe diff_se diff_pe
(16, 12, 1, 3)  77.3307 86.5965 +0.6626 +0.4397 99.181  99.180  -0.301  -0.301      pareto
(17, 13, 1, 3)  77.3382 86.5670 +0.6701 +0.4101 99.118  99.120  -0.364  -0.361      pareto
(18, 14, 1, 3)  77.3075 86.5272 +0.6393 +0.3704 99.042  99.040  -0.440  -0.442      
(17, 13, 0, 3)  76.9034 86.3236 +0.2352 +0.1668 99.387  99.387  -0.095  -0.094      
(18, 14, 0, 3)  76.9109 86.3185 +0.2428 +0.1617 99.344  99.342  -0.138  -0.140      
(16, 12, 0, 3)  76.8668 86.3232 +0.1986 +0.1664 99.446  99.446  -0.036  -0.035      
(18, 14, 0, 2)  76.8087 86.3316 +0.1405 +0.1747 99.250  99.250  -0.232  -0.232      
(17, 13, 0, 2)  76.7686 86.3275 +0.1004 +0.1707 99.292  99.293  -0.190  -0.188      
(19, 15, 0, 3)  76.8939 86.2880 +0.2257 +0.1311 99.283  99.285  -0.199  -0.196      
(19, 15, 0, 2)  76.7953 86.2902 +0.1272 +0.1333 99.202  99.201  -0.280  -0.280      
(16, 12, 0, 2)  76.7280 86.3065 +0.0598 +0.1496 99.347  99.347  -0.135  -0.134      
(20, 16, -2, 2) 76.6682 86.1568 +0.0000 +0.0000 99.482  99.481  +0.000  +0.000  *****   
(20, 16, -2, 3) 76.6916 86.1485 +0.0234 -0.0083 99.491  99.490  +0.009  +0.008      
(21, 17, -2, 1) 76.5986 86.1625 -0.0696 +0.0057 99.393  99.392  -0.089  -0.089      

## Read length 100: Weighted SE/PE results - with extension alignment

parameters  sacc_se sacc_pe diff_se diff_pe acc_se  acc_pe  diff_se diff_pe mprt_se mprt_pe diff_se diff_pe
(16, 12, 1, 3)  96.8270 98.8985 +0.0900 +0.0661 80.2273 89.7221 +0.2958 +0.1274 99.181  99.770  -0.301  -0.036      pareto
(17, 13, 1, 3)  96.8201 98.9200 +0.0831 +0.0876 80.1736 89.7258 +0.2421 +0.1311 99.118  99.769  -0.364  -0.036      pareto
(20, 16, -2, 2) 96.7371 98.8324 +0.0000 +0.0000 79.9315 89.5947 +0.0000 +0.0000 99.482  99.805  +0.000  +0.000  *****   

## Read length 150: Weighted SE/PE results - mapping-only

parameters  acc_se  acc_pe  diff_se diff_pe mprt_se mprt_pe diff_se diff_pe
(21, 17, 2, 6)  83.8433 90.4057 +0.0664 +0.0783 99.662  99.662  -0.065  -0.064      pareto
(20, 16, 2, 5)  83.7752 90.4192 -0.0016 +0.0917 99.666  99.668  -0.061  -0.059      pareto
(20, 16, 2, 6)  83.8460 90.3985 +0.0692 +0.0711 99.685  99.686  -0.042  -0.041      pareto
(21, 17, 2, 5)  83.7807 90.4081 +0.0038 +0.0806 99.645  99.643  -0.082  -0.084      pareto
(20, 16, 3, 6)  83.8200 90.3907 +0.0432 +0.0633 99.629  99.627  -0.098  -0.100      
(21, 17, 2, 7)  83.8663 90.3741 +0.0895 +0.0466 99.672  99.672  -0.055  -0.055      pareto
(20, 16, 2, 8)  83.8858 90.3632 +0.1089 +0.0357 99.699  99.698  -0.028  -0.029      pareto
(20, 16, 2, 7)  83.8592 90.3694 +0.0824 +0.0420 99.695  99.695  -0.032  -0.032      
(20, 16, 3, 5)  83.7589 90.3940 -0.0179 +0.0666 99.584  99.582  -0.143  -0.145      
(19, 15, 3, 7)  83.7862 90.3850 +0.0093 +0.0575 99.709  99.708  -0.018  -0.019      
(22, 18, 2, 7)  83.8621 90.3577 +0.0852 +0.0302 99.644  99.641  -0.083  -0.086      
(21, 17, 2, 8)  83.8798 90.3507 +0.1030 +0.0232 99.682  99.680  -0.045  -0.046      
(20, 16, 3, 7)  83.8378 90.3611 +0.0609 +0.0337 99.648  99.647  -0.079  -0.080      
(19, 15, 4, 7)  83.7849 90.3693 +0.0081 +0.0418 99.649  99.648  -0.078  -0.079      
(19, 15, 4, 8)  83.8282 90.3539 +0.0513 +0.0264 99.671  99.670  -0.056  -0.056      
(19, 15, 3, 8)  83.8379 90.3502 +0.0610 +0.0228 99.717  99.716  -0.010  -0.011      
(20, 16, 3, 8)  83.8600 90.3425 +0.0832 +0.0150 99.661  99.659  -0.067  -0.068      
(19, 15, 4, 6)  83.7073 90.3794 -0.0696 +0.0520 99.611  99.610  -0.116  -0.117      
(21, 17, 1, 7)  83.7966 90.3424 +0.0197 +0.0149 99.712  99.711  -0.015  -0.015      
(18, 14, 4, 8)  83.7598 90.3319 -0.0170 +0.0044 99.693  99.690  -0.034  -0.036      
(20, 16, 1, 7)  83.7768 90.3275 +0.0000 +0.0000 99.727  99.727  +0.000  +0.000  *****   
(20, 16, 1, 8)  83.8208 90.3148 +0.0440 -0.0127 99.730  99.729  +0.003  +0.002      
(21, 17, 1, 8)  83.8313 90.3107 +0.0545 -0.0167 99.715  99.714  -0.012  -0.013      
(18, 14, 3, 8)  83.7611 90.3143 -0.0157 -0.0131 99.728  99.727  +0.001  +0.000      
(22, 18, 1, 6)  83.7280 90.3213 -0.0488 -0.0061 99.681  99.680  -0.046  -0.046      
(22, 18, 1, 7)  83.7775 90.3080 +0.0007 -0.0194 99.686  99.684  -0.041  -0.043      
(19, 15, 2, 8)  83.7611 90.3010 -0.0158 -0.0264 99.741  99.740  +0.014  +0.013      
(18, 14, 5, 8)  83.6952 90.2884 -0.0816 -0.0391 99.633  99.630  -0.094  -0.096      

## Read length 150: Weighted SE/PE results - with extension alignment

parameters  sacc_se sacc_pe diff_se diff_pe acc_se  acc_pe  diff_se diff_pe mprt_se mprt_pe diff_se diff_pe
(21, 17, 2, 5)  97.8650 99.2698 -0.0823 +0.0274 86.1799 92.3607 -0.0529 +0.0468 99.645  99.780  -0.082  -0.001      pareto
(20, 16, 2, 5)  97.8731 99.2638 -0.0742 +0.0214 86.2209 92.3496 -0.0119 +0.0357 99.666  99.780  -0.061  -0.001      pareto
(21, 17, 2, 7)  97.9767 99.2781 +0.0294 +0.0357 86.2365 92.3424 +0.0037 +0.0285 99.672  99.781  -0.055  -0.000      pareto
(20, 16, 2, 8)  98.0105 99.2648 +0.0632 +0.0225 86.2533 92.3380 +0.0204 +0.0241 99.699  99.781  -0.028  -0.000      pareto
(21, 17, 2, 6)  97.9248 99.2729 -0.0225 +0.0305 86.2055 92.3472 -0.0273 +0.0333 99.662  99.781  -0.065  -0.000      
(20, 16, 2, 6)  97.9302 99.2642 -0.0171 +0.0218 86.2404 92.3337 +0.0076 +0.0198 99.685  99.781  -0.042  -0.000      
(20, 16, 1, 7)  97.9473 99.2424 -0.0000 -0.0000 86.2328 92.3139 +0.0000 +0.0000 99.727  99.781  +0.000  +0.000  *****   

## Read length 200: Weighted SE/PE results - mapping-only

parameters  acc_se  acc_pe  diff_se diff_pe mprt_se mprt_pe diff_se diff_pe
(24, 20, 4, 11) 87.5395 91.8639 +0.0401 +0.0623 99.720  99.717  -0.025  -0.025      pareto
(24, 20, 3, 12) 87.5548 91.8545 +0.0554 +0.0528 99.732  99.730  -0.013  -0.013      pareto
(24, 20, 4, 12) 87.5510 91.8549 +0.0516 +0.0533 99.721  99.718  -0.024  -0.024      pareto
(24, 20, 4, 10) 87.4988 91.8656 -0.0006 +0.0640 99.719  99.716  -0.026  -0.027      pareto
(24, 20, 3, 10) 87.5058 91.8626 +0.0065 +0.0610 99.730  99.729  -0.014  -0.014      
(24, 20, 3, 11) 87.5212 91.8571 +0.0218 +0.0555 99.732  99.730  -0.013  -0.013      
(24, 20, 3, 13) 87.5738 91.8424 +0.0744 +0.0408 99.733  99.731  -0.012  -0.012      pareto
(23, 19, 3, 12) 87.5488 91.8472 +0.0494 +0.0455 99.737  99.735  -0.008  -0.008      
(23, 19, 3, 10) 87.4930 91.8580 -0.0064 +0.0563 99.736  99.734  -0.009  -0.009      
...
(22, 18, 2, 12) 87.4994 91.8016 +0.0000 +0.0000 99.745  99.743  +0.000  +0.000  *****   

## Read length 200: Weighted SE/PE results - with extension alignment

parameters  sacc_se sacc_pe diff_se diff_pe acc_se  acc_pe  diff_se diff_pe mprt_se mprt_pe diff_se diff_pe
(24, 20, 4, 12) 98.4782 99.3803 +0.0502 +0.0442 89.4550 93.1223 +0.0041 +0.0295 99.721  99.750  -0.024  +0.000      pareto
(24, 20, 3, 12) 98.4726 99.3639 +0.0446 +0.0279 89.4582 93.1198 +0.0074 +0.0270 99.732  99.750  -0.013  +0.000      pareto
(24, 20, 4, 11) 98.4497 99.3765 +0.0216 +0.0404 89.4347 93.1223 -0.0161 +0.0295 99.720  99.750  -0.025  +0.000      
(24, 20, 3, 13) 98.4901 99.3666 +0.0620 +0.0306 89.4578 93.1104 +0.0070 +0.0176 99.733  99.750  -0.012  +0.000      
(24, 20, 4, 10) 98.4217 99.3768 -0.0063 +0.0407 89.4191 93.1175 -0.0318 +0.0246 99.719  99.750  -0.026  +0.000      
(22, 18, 2, 12) 98.4280 99.3361 +0.0000 -0.0000 89.4508 93.0928 +0.0000 +0.0000 99.745  99.750  +0.000  +0.000  *****   

## Read length 300: Weighted SE/PE results - mapping-only

parameters  acc_se  acc_pe  diff_se diff_pe mprt_se mprt_pe diff_se diff_pe
(24, 20, 6, 13) 91.0242 94.7012 +0.2182 +0.0960 99.692  99.690  -0.001  -0.001      pareto
(24, 20, 7, 13) 91.0230 94.6993 +0.2171 +0.0940 99.691  99.690  -0.001  -0.001      
(24, 20, 8, 13) 90.9953 94.7001 +0.1893 +0.0948 99.690  99.689  -0.002  -0.002      
(24, 20, 5, 13) 91.0323 94.6862 +0.2264 +0.0809 99.692  99.691  -0.000  -0.000      pareto
(23, 19, 6, 13) 91.0143 94.6817 +0.2084 +0.0764 99.692  99.690  -0.000  -0.000      
(24, 20, 6, 12) 90.9737 94.6912 +0.1678 +0.0859 99.691  99.690  -0.001  -0.001      
...
(22, 18, 2, 12) 90.8059 94.6053 +0.0000 +0.0000 99.692  99.691  +0.000  +0.000  *****   

## Read length 300: Weighted SE/PE results - with extension alignment

parameters  sacc_se sacc_pe diff_se diff_pe acc_se  acc_pe  diff_se diff_pe mprt_se mprt_pe diff_se diff_pe
(24, 20, 5, 13) 98.7497 99.4555 +0.0807 +0.0234 92.4811 95.6001 +0.0438 +0.0315 99.692  99.691  -0.000  +0.000      pareto
(24, 20, 6, 13) 98.7494 99.4608 +0.0804 +0.0287 92.4793 95.6004 +0.0419 +0.0319 99.692  99.691  -0.001  -0.000      pareto
(22, 18, 2, 12) 98.6690 99.4322 +0.0000 +0.0000 92.4373 95.5686 -0.0000 +0.0000 99.692  99.691  +0.000  +0.000  *****   

## Read length 500: Weighted SE/PE results - mapping-only

parameters  acc_se  acc_pe  diff_se diff_pe mprt_se mprt_pe diff_se diff_pe
(25, 19, 8, 13) 93.5670 95.5009 +0.2708 +0.1493 99.578  99.574  -0.000  -0.000      pareto
(25, 19, 7, 13) 93.5699 95.4957 +0.2737 +0.1441 99.578  99.574  -0.000  -0.000      pareto
(25, 19, 6, 13) 93.5695 95.4906 +0.2733 +0.1390 99.578  99.574  -0.000  +0.000      
(25, 19, 7, 12) 93.5153 95.4898 +0.2191 +0.1382 99.578  99.574  -0.000  -0.000      
...
(23, 17, 2, 12) 93.2962 95.3516 +0.0000 +0.0000 99.578  99.574  +0.000  +0.000  *****   

## Read length 500: Weighted SE/PE results - with extension alignment

parameters  sacc_se sacc_pe diff_se diff_pe acc_se  acc_pe  diff_se diff_pe mprt_se mprt_pe diff_se diff_pe
(25, 19, 8, 13) 98.9691 99.3465 +0.0594 +0.0285 94.7018 96.0529 +0.0562 +0.0347 99.578  99.574  -0.000  +0.000      pareto
(25, 19, 7, 13) 98.9757 99.3490 +0.0660 +0.0310 94.6976 96.0488 +0.0520 +0.0306 99.578  99.574  -0.000  -0.000      
(23, 17, 2, 12) 98.9097 99.3180 +0.0000 +0.0000 94.6456 96.0182 +0.0000 +0.0000 99.578  99.574  +0.000  +0.000  *****   

# Multi-context seeds (c4a7f61)

## Read length 50: Weighted SE/PE results - mapping-only

parameters  acc_se  acc_pe  diff_se diff_pe mprt_se mprt_pe diff_se diff_pe
(16, 12, -2, 0) 63.2349 74.9216 +0.3466 +0.6877 97.123  97.112  +0.146  +0.136      pareto
(16, 12, -2, 1) 63.2007 74.8490 +0.3124 +0.6150 97.252  97.244  +0.275  +0.268      
(16, 12, -2, 2) 63.1981 74.8449 +0.3097 +0.6110 97.254  97.246  +0.277  +0.270      
(17, 13, -2, 0) 63.1308 74.6860 +0.2425 +0.4520 96.977  96.969  -0.000  -0.007      
(17, 13, -2, 1) 63.0462 74.5605 +0.1578 +0.3265 97.151  97.143  +0.174  +0.168      
(17, 13, -2, 2) 63.0538 74.5557 +0.1654 +0.3218 97.158  97.150  +0.181  +0.174      
(18, 14, -2, 0) 62.9883 74.4344 +0.0999 +0.2004 96.756  96.754  -0.220  -0.222      
(18, 14, -2, 1) 62.8884 74.2339 +0.0000 +0.0000 96.977  96.976  +0.000  +0.000  *****   
(18, 14, -2, 2) 62.8892 74.2215 +0.0008 -0.0124 96.992  96.992  +0.015  +0.016      

## Read length 50: Weighted SE/PE results - with extension alignment

parameters  sacc_se sacc_pe diff_se diff_pe acc_se  acc_pe  diff_se diff_pe mprt_se mprt_pe diff_se diff_pe
(16, 12, -2, 0) 93.1696 97.3440 +0.0728 -0.0574 66.3488 80.9302 +0.5248 +0.1651 97.123  99.487  +0.146  -0.077      pareto
(18, 14, -2, 1) 93.0968 97.4014 +0.0000 +0.0000 65.8240 80.7651 +0.0000 +0.0000 96.977  99.565  +0.000  +0.000  *****   

## Read length 75: Weighted SE/PE results - mapping-only

parameters  acc_se  acc_pe  diff_se diff_pe mprt_se mprt_pe diff_se diff_pe
(21, 17, -3, -1)    71.7299 82.5870 +0.0342 +0.0823 98.887  98.888  -0.258  -0.260      pareto
(20, 16, -3, 0) 71.6998 82.5136 +0.0042 +0.0089 99.119  99.123  -0.025  -0.025      
(20, 16, -3, 1) 71.6971 82.5044 +0.0015 -0.0003 99.144  99.148  -0.000  -0.000      
(20, 16, -3, 2) 71.6957 82.5047 +0.0000 -0.0000 99.145  99.149  +0.000  +0.000  *****   
(20, 16, -3, 3) 71.6957 82.5047 +0.0000 -0.0000 99.145  99.149  +0.000  +0.000      

## Read length 75: Weighted SE/PE results - with extension alignment

parameters  sacc_se sacc_pe diff_se diff_pe acc_se  acc_pe  diff_se diff_pe mprt_se mprt_pe diff_se diff_pe
(21, 17, -3, -1)    95.7962 98.4268 -0.1689 -0.0038 74.9228 86.7618 -0.0016 +0.0307 98.887  99.756  -0.258  -0.036      pareto
(20, 16, -3, 2) 95.9650 98.4305 +0.0000 +0.0000 74.9245 86.7311 +0.0000 +0.0000 99.145  99.792  +0.000  -0.000  *****   pareto

## Read length 100: Weighted SE/PE results - mapping-only

parameters  acc_se  acc_pe  diff_se diff_pe mprt_se mprt_pe diff_se diff_pe
(18, 14, 1, 3)  77.4674 86.7038 +0.6795 +0.4186 99.235  99.238  -0.367  -0.363      pareto
(17, 13, 1, 3)  77.4442 86.6853 +0.6563 +0.4001 99.266  99.267  -0.335  -0.335      
(19, 15, 1, 3)  77.4468 86.6761 +0.6589 +0.3909 99.221  99.220  -0.380  -0.382      
(16, 12, 1, 3)  77.3820 86.6520 +0.5941 +0.3668 99.289  99.288  -0.312  -0.314      
(20, 16, 0, 3)  77.4297 86.6071 +0.6418 +0.3219 99.199  99.195  -0.402  -0.407      
(20, 16, 0, 2)  77.4049 86.6042 +0.6170 +0.3190 99.164  99.159  -0.437  -0.442      
(19, 15, 0, 3)  77.0658 86.4753 +0.2779 +0.1900 99.471  99.472  -0.131  -0.129      
(21, 17, -1, 1) 77.0216 86.4653 +0.2337 +0.1801 99.312  99.308  -0.289  -0.293      
(20, 16, -1, 3) 77.1043 86.4446 +0.3164 +0.1594 99.451  99.452  -0.151  -0.149      
(20, 16, -1, 1) 76.9934 86.4722 +0.2056 +0.1870 99.345  99.349  -0.256  -0.253      
(20, 16, -1, 2) 77.0752 86.4512 +0.2874 +0.1660 99.431  99.434  -0.171  -0.168      
(18, 14, 0, 3)  77.0269 86.4572 +0.2390 +0.1720 99.484  99.484  -0.118  -0.118      
(19, 15, 0, 2)  76.9685 86.4657 +0.1807 +0.1805 99.389  99.387  -0.213  -0.215      
(18, 14, 0, 2)  76.9113 86.4563 +0.1234 +0.1711 99.393  99.394  -0.208  -0.208      
(17, 13, 0, 3)  76.9784 86.4225 +0.1905 +0.1373 99.495  99.497  -0.107  -0.105      
(17, 13, 0, 2)  76.8298 86.4074 +0.0419 +0.1222 99.402  99.404  -0.199  -0.198      
(16, 12, 0, 3)  76.9166 86.3725 +0.1287 +0.0873 99.525  99.525  -0.076  -0.076      
(22, 18, -2, 3) 76.8749 86.3023 +0.0871 +0.0171 99.561  99.563  -0.040  -0.039      
(22, 18, -2, 1) 76.7769 86.3194 -0.0110 +0.0342 99.512  99.512  -0.089  -0.090      
(22, 18, -2, 2) 76.8542 86.2984 +0.0663 +0.0132 99.553  99.554  -0.048  -0.048      
(21, 17, -2, 2) 76.8275 86.3046 +0.0397 +0.0194 99.580  99.579  -0.022  -0.023      
(21, 17, -2, 3) 76.8509 86.2957 +0.0631 +0.0105 99.588  99.587  -0.013  -0.014      
(21, 17, -2, 1) 76.7558 86.3089 -0.0320 +0.0237 99.536  99.534  -0.065  -0.068      
(20, 16, -2, 3) 76.8415 86.2799 +0.0536 -0.0053 99.611  99.611  +0.010  +0.009      
(20, 16, -2, 2) 76.7879 86.2852 +0.0000 +0.0000 99.601  99.602  +0.000  +0.000  *****   

## Read length 100: Weighted SE/PE results - with extension alignment

parameters  sacc_se sacc_pe diff_se diff_pe acc_se  acc_pe  diff_se diff_pe mprt_se mprt_pe diff_se diff_pe
(18, 14, 1, 3)  96.9435 98.9435 +0.0754 +0.1069 80.2780 89.7660 +0.2204 +0.1587 99.235  99.773  -0.367  -0.033      pareto
(20, 16, -2, 2) 96.8680 98.8366 +0.0000 +0.0000 80.0575 89.6073 +0.0000 +0.0000 99.601  99.806  +0.000  +0.000  *****   

## Read length 150: Weighted SE/PE results - mapping-only

parameters  acc_se  acc_pe  diff_se diff_pe mprt_se mprt_pe diff_se diff_pe
(23, 19, 3, 6)  83.9188 90.4801 +0.1618 +0.1330 99.690  99.689  -0.072  -0.072      pareto
(23, 19, 3, 5)  83.8571 90.4898 +0.1000 +0.1427 99.653  99.653  -0.109  -0.109      pareto
(22, 18, 3, 5)  83.8277 90.4916 +0.0707 +0.1445 99.662  99.664  -0.099  -0.098      pareto
(22, 18, 3, 6)  83.8811 90.4733 +0.1240 +0.1263 99.700  99.700  -0.062  -0.061      
(23, 19, 2, 5)  83.8464 90.4809 +0.0894 +0.1338 99.710  99.709  -0.052  -0.053      
(23, 19, 3, 7)  83.9240 90.4585 +0.1670 +0.1115 99.710  99.709  -0.052  -0.053      pareto
(21, 17, 3, 5)  83.8062 90.4857 +0.0491 +0.1387 99.669  99.669  -0.093  -0.092      
(23, 19, 2, 6)  83.8871 90.4654 +0.1301 +0.1184 99.725  99.725  -0.037  -0.037      
(23, 19, 2, 7)  83.9296 90.4526 +0.1725 +0.1055 99.734  99.734  -0.028  -0.028      pareto
(22, 18, 2, 5)  83.8371 90.4757 +0.0800 +0.1286 99.714  99.712  -0.048  -0.049      
(21, 17, 2, 6)  83.8632 90.4686 +0.1062 +0.1215 99.734  99.734  -0.027  -0.028      
(22, 18, 2, 6)  83.8924 90.4575 +0.1354 +0.1104 99.731  99.730  -0.031  -0.032      
(21, 17, 3, 6)  83.8703 90.4606 +0.1133 +0.1135 99.711  99.709  -0.051  -0.052      
(21, 17, 3, 7)  83.8881 90.4484 +0.1311 +0.1014 99.726  99.725  -0.036  -0.037      
(22, 18, 2, 7)  83.9069 90.4414 +0.1498 +0.0943 99.740  99.740  -0.022  -0.022      
(20, 16, 3, 6)  83.8398 90.4549 +0.0828 +0.1079 99.717  99.716  -0.045  -0.046      
(22, 18, 3, 7)  83.9017 90.4377 +0.1446 +0.0907 99.720  99.720  -0.042  -0.042      
(23, 19, 2, 8)  83.9332 90.4283 +0.1761 +0.0813 99.740  99.739  -0.022  -0.022      pareto
(23, 19, 3, 8)  83.9324 90.4264 +0.1754 +0.0793 99.721  99.720  -0.041  -0.042      
...
(20, 16, 1, 7)  83.7570 90.3470 +0.0000 +0.0000 99.762  99.762  -0.000  +0.000  *****   

## Read length 150: Weighted SE/PE results - with extension alignment

parameters  sacc_se sacc_pe diff_se diff_pe acc_se  acc_pe  diff_se diff_pe mprt_se mprt_pe diff_se diff_pe
(22, 18, 3, 5)  97.9170 99.3012 -0.0674 +0.0578 86.2434 92.3854 -0.0286 +0.0711 99.662  99.779  -0.099  -0.002      pareto
(23, 19, 3, 5)  97.9233 99.3057 -0.0610 +0.0623 86.2171 92.3813 -0.0549 +0.0670 99.653  99.779  -0.109  -0.002      
(23, 19, 3, 6)  97.9852 99.3068 +0.0008 +0.0634 86.2630 92.3627 -0.0091 +0.0484 99.690  99.780  -0.072  -0.001      pareto
(23, 19, 3, 7)  98.0339 99.3086 +0.0496 +0.0652 86.2616 92.3630 -0.0105 +0.0487 99.710  99.781  -0.052  -0.000      pareto
(23, 19, 2, 7)  98.0662 99.2976 +0.0819 +0.0542 86.2846 92.3562 +0.0126 +0.0419 99.734  99.781  -0.028  -0.000      pareto
(23, 19, 2, 8)  98.1032 99.2958 +0.1189 +0.0524 86.3059 92.3469 +0.0339 +0.0326 99.740  99.781  -0.022  -0.000      pareto
(20, 16, 1, 7)  97.9843 99.2434 +0.0000 +0.0000 86.2720 92.3143 +0.0000 +0.0000 99.762  99.781  -0.000  +0.000  *****   

## Read length 200: Weighted SE/PE results - mapping-only

parameters  acc_se  acc_pe  diff_se diff_pe mprt_se mprt_pe diff_se diff_pe
(24, 20, 4, 12) 87.4547 91.8541 +0.0733 +0.0944 99.747  99.745  -0.003  -0.003      pareto
(24, 20, 5, 10) 87.4016 91.8602 +0.0203 +0.1006 99.741  99.739  -0.009  -0.009      pareto
(24, 20, 5, 11) 87.4243 91.8544 +0.0430 +0.0947 99.743  99.741  -0.007  -0.007      pareto
(24, 20, 4, 11) 87.4371 91.8494 +0.0557 +0.0897 99.746  99.744  -0.004  -0.004      
(24, 20, 4, 10) 87.4239 91.8470 +0.0426 +0.0874 99.745  99.744  -0.005  -0.005      
(24, 20, 5, 12) 87.4302 91.8424 +0.0489 +0.0827 99.744  99.742  -0.006  -0.006      
(24, 20, 4, 13) 87.4852 91.8282 +0.1038 +0.0685 99.748  99.746  -0.002  -0.002      pareto
(24, 20, 5, 13) 87.4570 91.8312 +0.0757 +0.0716 99.745  99.743  -0.005  -0.005      pareto
(24, 20, 3, 13) 87.4898 91.8222 +0.1084 +0.0626 99.749  99.747  -0.001  -0.001      pareto
(24, 20, 3, 10) 87.4065 91.8406 +0.0252 +0.0809 99.747  99.746  -0.003  -0.003      
...
(22, 18, 2, 12) 87.3813 91.7597 +0.0000 +0.0000 99.750  99.748  +0.000  +0.000  *****   

## Read length 200: Weighted SE/PE results - with extension alignment

parameters  sacc_se sacc_pe diff_se diff_pe acc_se  acc_pe  diff_se diff_pe mprt_se mprt_pe diff_se diff_pe
(24, 20, 4, 12) 98.5003 99.3799 +0.0720 +0.0447 89.4840 93.1277 +0.0491 +0.0333 99.747  99.750  -0.003  +0.000      pareto
(24, 20, 5, 13) 98.5112 99.3841 +0.0828 +0.0489 89.4901 93.1235 +0.0551 +0.0291 99.745  99.750  -0.005  +0.000      pareto
(24, 20, 5, 11) 98.4640 99.3801 +0.0356 +0.0449 89.4556 93.1223 +0.0206 +0.0279 99.743  99.750  -0.007  +0.000      
(24, 20, 4, 13) 98.5208 99.3765 +0.0924 +0.0413 89.4810 93.1150 +0.0461 +0.0207 99.748  99.750  -0.002  +0.000      
(24, 20, 5, 10) 98.4288 99.3803 +0.0004 +0.0451 89.4413 93.1245 +0.0064 +0.0302 99.741  99.750  -0.009  +0.000      
(24, 20, 3, 13) 98.5043 99.3654 +0.0759 +0.0302 89.4898 93.1114 +0.0548 +0.0170 99.749  99.750  -0.001  +0.000      
(22, 18, 2, 12) 98.4284 99.3352 +0.0000 +0.0000 89.4350 93.0943 +0.0000 +0.0000 99.750  99.750  +0.000  +0.000  *****   

## Read length 300: Weighted SE/PE results - mapping-only

parameters  acc_se  acc_pe  diff_se diff_pe mprt_se mprt_pe diff_se diff_pe
(24, 20, 6, 13) 90.7976 94.5880 +0.1863 +0.0741 99.692  99.691  -0.000  -0.000      pareto
(24, 20, 5, 13) 90.8041 94.5827 +0.1927 +0.0688 99.692  99.691  +0.000  +0.000      pareto
(24, 20, 7, 13) 90.7785 94.5875 +0.1671 +0.0735 99.692  99.691  -0.000  -0.000      
(24, 20, 6, 12) 90.7392 94.5910 +0.1279 +0.0770 99.692  99.690  -0.000  -0.000      pareto
(24, 20, 4, 13) 90.7948 94.5756 +0.1835 +0.0616 99.692  99.691  +0.000  +0.000      
(24, 20, 4, 12) 90.7514 94.5846 +0.1400 +0.0707 99.692  99.691  -0.000  -0.000      
(24, 20, 5, 12) 90.7636 94.5773 +0.1523 +0.0633 99.692  99.691  +0.000  -0.000      
(24, 20, 3, 13) 90.7737 94.5729 +0.1624 +0.0590 99.692  99.691  +0.000  +0.000      
...
(22, 18, 2, 12) 90.6113 94.5139 +0.0000 +0.0000 99.692  99.691  +0.000  +0.000  *****   

## Read length 300: Weighted SE/PE results - with extension alignment

parameters  sacc_se sacc_pe diff_se diff_pe acc_se  acc_pe  diff_se diff_pe mprt_se mprt_pe diff_se diff_pe
(24, 20, 6, 13) 98.7377 99.4604 +0.0772 +0.0287 92.4746 95.6028 +0.0431 +0.0364 99.692  99.691  -0.000  -0.000      pareto
(24, 20, 5, 13) 98.7374 99.4547 +0.0769 +0.0229 92.4670 95.5950 +0.0355 +0.0286 99.692  99.691  +0.000  +0.000      
(24, 20, 6, 12) 98.7202 99.4553 +0.0596 +0.0235 92.4535 95.5976 +0.0220 +0.0312 99.692  99.691  -0.000  -0.000      
(22, 18, 2, 12) 98.6606 99.4318 +0.0000 +0.0000 92.4316 95.5664 +0.0000 +0.0000 99.692  99.691  +0.000  +0.000  *****   

## Read length 500: Weighted SE/PE results - mapping-only

parameters  acc_se  acc_pe  diff_se diff_pe mprt_se mprt_pe diff_se diff_pe
(25, 19, 7, 13) 93.2384 95.3643 +0.1550 +0.1071 99.578  99.574  -0.000  -0.000      pareto
(25, 19, 6, 13) 93.2549 95.3558 +0.1714 +0.0986 99.578  99.574  -0.000  +0.000      pareto
(25, 19, 5, 13) 93.2581 95.3468 +0.1747 +0.0896 99.578  99.574  +0.000  +0.000      pareto
(25, 19, 7, 12) 93.1965 95.3606 +0.1131 +0.1034 99.578  99.574  -0.000  -0.000      
(25, 19, 6, 12) 93.2142 95.3496 +0.1307 +0.0924 99.578  99.574  -0.000  +0.000      
(25, 19, 4, 13) 93.2463 95.3391 +0.1628 +0.0819 99.578  99.574  +0.000  +0.000      
...
(23, 17, 2, 12) 93.0834 95.2572 +0.0000 +0.0000 99.578  99.574  +0.000  +0.000  *****   

## Read length 500: Weighted SE/PE results - with extension alignment

parameters  sacc_se sacc_pe diff_se diff_pe acc_se  acc_pe  diff_se diff_pe mprt_se mprt_pe diff_se diff_pe
(25, 19, 7, 13) 98.9583 99.3693 +0.0592 +0.0267 94.6867 96.0819 +0.0416 +0.0251 99.578  99.574  -0.000  +0.000      pareto
(25, 19, 5, 13) 98.9473 99.3601 +0.0482 +0.0175 94.6765 96.0713 +0.0313 +0.0145 99.578  99.574  +0.000  +0.000      
(25, 19, 6, 13) 98.9524 99.3623 +0.0533 +0.0197 94.6751 96.0669 +0.0299 +0.0101 99.578  99.574  -0.000  +0.000      
(23, 17, 2, 12) 98.8991 99.3426 +0.0000 +0.0000 94.6451 96.0568 +0.0000 +0.0000 99.578  99.574  +0.000  +0.000  *****

ksahlin commented 3 months ago

Great, I think we could go with your suggested parameter changes in this issue for a benchmark between current hashing and multi-context hashing.

It is interesting that many of the read lengths have the same parameter combination; I am not sure if this is a sign of something bad (e.g., overfitting the design to data, underutilization of partial hits, or underevaluation). Regardless, I think it serves its purpose for now. We are thinking about asymmetrical seeds, which are more important now and may alter things slightly.

(Note: we should probably log how many times we successfully used a 'partial hit', and not the full hit, in the new hashing scheme in further evaluations. Here, 'successfully' is a bit vague and could have several meanings, such as simply finding a partial hit and that they were used in making a higher scoring NAM/pair of NAMs)

marcelm commented 3 months ago

I have added two branches to the repository, each with a single new commit that switches to the optimized parameters:

v0.12.0-optimized-parameters is on top of v0.12.0.
mcs-optimized-parameters is on top of Ivan’s multi-context-seeds branch

For completeness, I picked (20, 16, 1, 4) for canonical read length 125 for both branches, but this should not be relevant as the test datasets don’t include that read length.

I also noticed that v0.12.0 still has canonical read length 300, so I left it that way and did not use the interpolated parameters as I had originally suggested.

It would be possible to apply these changes on top of v0.13.0, but since I benchmarked v0.12.0 and there have been very few changes since then that affect accuracy, I suggest we stick to v0.12.0.

ksahlin commented 3 months ago

I have started a benchmark of the two commits.

For completeness, I picked (20, 16, 1, 4) for canonical read length 125 for both branches, but this should not be relevant as the test datasets don’t include that read length.

The evaluation does include read length 125 as well as read lengths ["50", "75", "91", "100", "111", "125", "136", "150", "176", "200", "250", "300", "500"] to test 'worst case' for some of the parameter ranges.

It would be possible to apply these changes on top of v0.13.0, but since I benchmarked v0.12.0 and there have been very few changes since then that affect accuracy, I suggest we stick to v0.12.0.

It's great to compare these two commits as a checkpoint to see where we are. However, I am afraid this might not be the last benchmark I do between the two seeding variants. The larger goal before an eventual merge of mcs would be to get rid of the redundant NAMs causing redundant extension calls (particularly visible in the mcs branch). Ivan is now exploring the asymmetrical version of mcs, checking whether my comment is true https://github.com/ksahlin/strobealign/pull/405#issuecomment-2001902240. If my guess would be correct, it would be nice to benchmark two asymmetrical versions against each other.

ksahlin commented 3 months ago

Evaluation is ready (see attached plots). All results are for PE alignment, symmetric seeds. Main points:

Accuracy

Extension based accuracy is near identical between the two seeds.
Mapping-only based accuracy is slightly better for mcs for short reads (see particularly drosophila and CHM13), and slightly worse for longer reads. Notable here is the dip at read lengths 111 for our current seeds. Another notable issue is that msc are strictly worse for longer seeds. I do not expect (/accept:) this.

Percent mapped

mcs beats current seeds in almost all cases and with quite a big margin, which is nice to see.

Runtime

is seems mcs are more often faster than not for short reads - nice! Possible because of less rescue extension.
mcs are consistently quite substantially slower than current seeds for the longest reads. Ivan and I believe that this is because more mapping sites are tried with extension due to more matches (coming from partial matches). If 'chaining'/scoring of NAMs is implemented well, I do not see a reason for accepting this. Using asymmetric seeds would lead to better NAM merging, hence scoring, and would take care of this (according to @marcelm's analysis).

Overall:

mcs offer some clear advantages in mapping (in mapped percentage, accuracy, and time) for short reads, but is currently slightly stifled by NAM scoring/chaining, leading to lower accuracy and slower runtime on longer reads. It will be interesting seeing if this can this be solved with asymmetric seeds. If this last issue is ironed out, I think we have a strong case for using mcs as new strategy.
Evaluation does not include SE alignment - but all evidence points to msc being even better (relatively) on SE data.

@Itolstoganov

accuracy_plot_cut_at_80.pdf percentage_aligned_plot.pdf time_plot.pdf

ksahlin / strobealign

Optimize parameters (again) #407

Suggested changes

Details for v0.12