numbbo / coco

Numerical Black-Box Optimization Benchmarking Framework
https://numbbo.github.io/coco
Other
262 stars 87 forks source link

Postprocessing shows strange test results when no finite ERT is present #125

Closed brockho closed 8 years ago

brockho commented 9 years ago

When comparing e.g. RANDOMSEARCH, fmincon, and OQNLP, some strange results of the statistical tests are displayed in the tables. Look for example at f3:

Δ fopt 1e1 1e0 1e-1 1e-2 1e-3 1e-5 1e-7 #succ f3 716 1622 1637 1642 1646 1650 1654 15/15
RANDOMSEARCH 6763 (4744) ∞★2 ∞★2 ∞★2 ∞★2 ∞5e6★2 0/15
fmincon 5.2 (6) 122 (95) ∞6e4 0/15
OQNLP 8.6 (12) 41 (27) ∞1e4 0/15
dtusar commented 8 years ago

@brockho Can you explain me, what is strange in these results?

brockho commented 8 years ago

The stars usually indicate statistically significant differences in the ERT (with the better algorithm having the star). Since all three algorithms have an infinite ERT here, it seems simply like a bug that RANDOMSEARCH is better than the other two (also seeing the ERT for the easier targets 1e1 and 1e0 for which RANDOMSEARCH is worse than.fmincon and OQNLP).

dtusar commented 8 years ago

I checked the code and here is the condition for adding the star:

nbtests * significance_versus_others[target_index][1] < 0.05, where nbtests = 24

The value of the significance_versus_others[target_index][1] is in this case equal to 3.0666297754500249e-06, so the condition above is easily satisfied.

Shouldn't the more significantly different algorithms have higher values stored in the significance_versus_others?

nikohansen commented 8 years ago

Given from what I see, significance_versus_others is a so-called p-value, that is, a probability that the observed data is different (in this case better) than another observed data.

Re the original question: the significance test does not test or consider ERT. It considers the ordering of algorithms w.r.t. where they cut a horizontal-vertical line in the convergence plot. When an algorithm does never cut the horizontal part, ERT cannot be computed, significance still can.

brockho commented 8 years ago

Okay, then the data table above does not look wrong anymore and we can close the issue?