Benchmark results and Overfitting

joaquinvanschoren commented 6 years ago

Maybe I'm thinking too far ahead, but there are a few obvious criticisms that we may get from reviewers that are related to how this benchmark is going to be used (assuming we want this to be the reference benchmarks for the field).

2 important issues here that are typical for benchmarking studies:

Cheating: people can look at the training sets and just publish the correct predictions. This requires some hacking of the OpenML APIs but it's not impossible. How about an algorithm that queries OpenML for the best flow for each task and just runs that? I guess we need a more explicit and visible way to report cheating/issues on the result page and offer a switch to only show results without issues? If only to discourage people from doing this. Do we want people to give their real names when they submit results?

Maybe we can - to some extent - run the flows on the server and try to reproduce the results, and then add a special label to those runs.

Overfitting: on a single task, it is quite easy to submit results many results until - by chance - they overfit on the entire 10-fold CV. For the benchmark, the results are aggregated over multiple tasks, so overfitting is less likely, but not impossible.

Some ways to alleviate this problem:

Use 10x10fold CV instead of normal 10fold CV tasks, maybe as an OpenML-CC18x10 benchmark that people can choose to use instead (assuming that the results here are more authoritative). Maybe even an OpenML-CC18x100.
Have an 'evolving' benchmark: as new datasets are added in CC19, CC20 etc., we can show how results/rankings change over time. Overfitted flows on CC18 will likely perform worse in CC19 etc.
Other ideas? Differential privacy?

In addition, we should also show an aggregated view of the score on the individual tasks (e.g. violin plots?) and do statistical tests? We could do the typical Friedman-Nemenyi test, but not sure if that will work that well on 'only' 80-something datasets.

We could of course wave our hands and say 'yes, but we are only solving the problem of non-standardized benchmark tests and these issues apply to any benchmarking study' but in a way these issues are connected...

janvanrijn commented 6 years ago

I like the idea of an evolving benchmark a lot. If I remember correctly you also had this in your thesis, a plot showing how the performance of several simple algorithms decreased over time as datasets became more complex.

I'm not so concerned about the cheating part. Machine Learning research is vulnerable to many forms of (accidental and actual) cheating, e.g., tuning the random seed, cherry picking tasks. In this sense, I think OpenML has the potential to solve more of these issues (e.g., with a fn 'rerun flow locally') than it creates.

mfeurer commented 6 years ago

Another way of overfitting to a single task: Using the score calculated by OpenML to do Bayesian optimization.

joaquinvanschoren commented 6 years ago

Heh, yeah. But you could do the same by looking at your test sets. Hence, cheating? I would definitely flag a flow that does that?

On Mon, 19 Mar 2018 at 10:04, Matthias Feurer notifications@github.com wrote:

Another way of overfitting to a single task: Using the score calculated by OpenML to do Bayesian optimization.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/openml/OpenMLFirstBenchmarkSuite/issues/31#issuecomment-374143477, or mute the thread https://github.com/notifications/unsubscribe-auth/ABpQV7yjKPb2DjL_MpGYSZo2fWofFWOOks5tf3ShgaJpZM4StcnP .

-- Thank you, Joaquin

mfeurer commented 6 years ago

But you wouldn't see this on the website as the user would run BO locally. And it wouldn't be an issue of the flow, but the run. You could tamper with the connector to do this completely local of course, too.

openml / benchmark-suites

Benchmark results and Overfitting #31