Closed brockho closed 8 years ago
@brockho Can you explain me, what is strange in these results?
The stars usually indicate statistically significant differences in the ERT (with the better algorithm having the star). Since all three algorithms have an infinite ERT here, it seems simply like a bug that RANDOMSEARCH is better than the other two (also seeing the ERT for the easier targets 1e1 and 1e0 for which RANDOMSEARCH is worse than.fmincon and OQNLP).
I checked the code and here is the condition for adding the star:
nbtests * significance_versus_others[target_index][1] < 0.05
, where nbtests = 24
The value of the significance_versus_others[target_index][1]
is in this case equal to 3.0666297754500249e-06, so the condition above is easily satisfied.
Shouldn't the more significantly different algorithms have higher values stored in the significance_versus_others
?
Given from what I see, significance_versus_others
is a so-called p
-value, that is, a probability that the observed data is different (in this case better) than another observed data.
Re the original question: the significance test does not test or consider ERT. It considers the ordering of algorithms w.r.t. where they cut a horizontal-vertical line in the convergence plot. When an algorithm does never cut the horizontal part, ERT cannot be computed, significance still can.
Okay, then the data table above does not look wrong anymore and we can close the issue?
When comparing e.g. RANDOMSEARCH, fmincon, and OQNLP, some strange results of the statistical tests are displayed in the tables. Look for example at f3: