Postprocessing shows strange test results when no finite ERT is present

numbbo / coco

Numerical Black-Box Optimization Benchmarking Framework

https://numbbo.github.io/coco

Other

262 stars 87 forks source link

Postprocessing shows strange test results when no finite ERT is present #125

Closed brockho closed 8 years ago

brockho commented 9 years ago

When comparing e.g. RANDOMSEARCH, fmincon, and OQNLP, some strange results of the statistical tests are displayed in the tables. Look for example at f3:

Δ fopt	1e1	1e0	1e-1	1e-2	1e-3	1e-5	1e-7	#succ f3
RANDOMSEARCH	6763 (4744)	∞	∞★2	∞★2	∞★2	∞★2	∞5e6★2	0/15
fmincon	5.2 (6)	122 (95)	∞	∞	∞	∞	∞6e4	0/15
OQNLP	8.6 (12)	41 (27)	∞	∞	∞	∞	∞1e4	0/15

dtusar commented 8 years ago

@brockho Can you explain me, what is strange in these results?

brockho commented 8 years ago

The stars usually indicate statistically significant differences in the ERT (with the better algorithm having the star). Since all three algorithms have an infinite ERT here, it seems simply like a bug that RANDOMSEARCH is better than the other two (also seeing the ERT for the easier targets 1e1 and 1e0 for which RANDOMSEARCH is worse than.fmincon and OQNLP).

dtusar commented 8 years ago

I checked the code and here is the condition for adding the star:

nbtests * significance_versus_others[target_index][1] < 0.05, where nbtests = 24

The value of the significance_versus_others[target_index][1] is in this case equal to 3.0666297754500249e-06, so the condition above is easily satisfied.

Shouldn't the more significantly different algorithms have higher values stored in the significance_versus_others?

nikohansen commented 8 years ago

Given from what I see, significance_versus_others is a so-called p-value, that is, a probability that the observed data is different (in this case better) than another observed data.

Re the original question: the significance test does not test or consider ERT. It considers the ordering of algorithms w.r.t. where they cut a horizontal-vertical line in the convergence plot. When an algorithm does never cut the horizontal part, ERT cannot be computed, significance still can.

brockho commented 8 years ago

Okay, then the data table above does not look wrong anymore and we can close the issue?