ksahlin / strobealign

Aligns short reads using dynamic seed size with strobemers
MIT License
141 stars 17 forks source link

Penalize far apart pairs not more than as if they were two single ends #332

Closed marcelm closed 1 year ago

marcelm commented 1 year ago

Closes #321.

The accuracy increases indeed a little bit.

Comparing accuracy

dataset main (4893484) this PR (d723bb4) difference
drosophila-50 90.189 90.1903 +0.0013
drosophila-75 91.6441 91.64485 +0.0007
drosophila-100 92.38785 92.38855 +0.0007
drosophila-150 93.2112 93.21115 -0.0001
drosophila-200 93.52085 93.52115 +0.0003
drosophila-300 95.36085 95.36125 +0.0004
drosophila-500 95.6936 95.6938 +0.0002
maize-50 71.4703 71.47195 +0.0016
maize-75 82.1255 82.12625 +0.0007
maize-100 87.13175 87.1324 +0.0007
maize-150 91.6731 91.67345 +0.0003
maize-200 92.92045 92.9206 +0.0002
maize-300 96.70835 96.70855 +0.0002
maize-500 97.2899 97.29025 +0.0003
CHM13-50 90.635 90.63505 +0.0001
CHM13-75 92.5158 92.5161 +0.0003
CHM13-100 93.21985 93.22015 +0.0003
CHM13-150 94.14035 94.1402 -0.0002
CHM13-200 94.43405 94.43425 +0.0002
CHM13-300 95.62665 95.62645 -0.0002
CHM13-500 95.9505 95.9505 +0.0000
rye-50 69.1402 69.14095 +0.0007
rye-75 80.53445 80.5352 +0.0007
rye-100 85.6483 85.6488 +0.0005
rye-150 90.20375 90.2038 +0.0001
rye-200 91.4773 91.4774 +0.0001
rye-300 94.5574 94.55795 +0.0005
rye-500 95.1326 95.1331 +0.0005

Average difference: +0.0004

I tried a couple of other things. In the formula you suggested in #321, you had an epsilon_pair term. I tested with epsilon_pair=2 and epsilon_pair=5 and while individual accuracies change a little bit for each dataset, the average difference remains at +0.0004 in both cases. I guess the term would have an effect for other datasets, so if you want, we can include it, but we canot measure its effect with the data we have.

Also, as you suggested, I tried changing the three occurrences of the hard-coded -20 in the function to either -10 or -30 (this was in addition to introducing the std::max()). This made the average difference worse (-0.1860 and -0.0056, respectively).

ksahlin commented 1 year ago

Great to have this fix!

if you want, we can include it, but we canot measure its effect with the data we have.

With epsilon_pair I meant that it should be the deciding factor only in cases where 'all else equal'. An epsilon_pair 2 or 5 sounds a bit high and may interfere with disjoint alignments that have better alignment score. How about epsilon_pair=0.001? I would be ok with adding such term without having a proper SV dataset.

marcelm commented 1 year ago

Alright, I added an epsilon of 0.001. I also think it’s fine to have this without more extensive testing.

ksahlin commented 1 year ago

Ok, approved