Comparing accuracy

dataset	main (4893484)	this PR (d723bb4)	difference
drosophila-50	90.189	90.1903	+0.0013
drosophila-75	91.6441	91.64485	+0.0007
drosophila-100	92.38785	92.38855	+0.0007
drosophila-150	93.2112	93.21115	-0.0001
drosophila-200	93.52085	93.52115	+0.0003
drosophila-300	95.36085	95.36125	+0.0004
drosophila-500	95.6936	95.6938	+0.0002
maize-50	71.4703	71.47195	+0.0016
maize-75	82.1255	82.12625	+0.0007
maize-100	87.13175	87.1324	+0.0007
maize-150	91.6731	91.67345	+0.0003
maize-200	92.92045	92.9206	+0.0002
maize-300	96.70835	96.70855	+0.0002
maize-500	97.2899	97.29025	+0.0003
CHM13-50	90.635	90.63505	+0.0001
CHM13-75	92.5158	92.5161	+0.0003
CHM13-100	93.21985	93.22015	+0.0003
CHM13-150	94.14035	94.1402	-0.0002
CHM13-200	94.43405	94.43425	+0.0002
CHM13-300	95.62665	95.62645	-0.0002
CHM13-500	95.9505	95.9505	+0.0000
rye-50	69.1402	69.14095	+0.0007
rye-75	80.53445	80.5352	+0.0007
rye-100	85.6483	85.6488	+0.0005
rye-150	90.20375	90.2038	+0.0001
rye-200	91.4773	91.4774	+0.0001
rye-300	94.5574	94.55795	+0.0005
rye-500	95.1326	95.1331	+0.0005

Average difference: +0.0004

I tried a couple of other things. In the formula you suggested in #321, you had an epsilon_pair term. I tested with epsilon_pair=2 and epsilon_pair=5 and while individual accuracies change a little bit for each dataset, the average difference remains at +0.0004 in both cases. I guess the term would have an effect for other datasets, so if you want, we can include it, but we canot measure its effect with the data we have.

Also, as you suggested, I tried changing the three occurrences of the hard-coded -20 in the function to either -10 or -30 (this was in addition to introducing the std::max()). This made the average difference worse (-0.1860 and -0.0056, respectively).

ksahlin commented 1 year ago

Great to have this fix!

if you want, we can include it, but we canot measure its effect with the data we have.

With epsilon_pair I meant that it should be the deciding factor only in cases where 'all else equal'. An epsilon_pair 2 or 5 sounds a bit high and may interfere with disjoint alignments that have better alignment score. How about epsilon_pair=0.001? I would be ok with adding such term without having a proper SV dataset.

marcelm commented 1 year ago

Alright, I added an epsilon of 0.001. I also think it’s fine to have this without more extensive testing.

ksahlin commented 1 year ago

Ok, approved

ksahlin / strobealign

Penalize far apart pairs not more than as if they were two single ends #332

Comparing accuracy