Open joaquinvanschoren opened 6 years ago
I like the idea of an evolving benchmark a lot. If I remember correctly you also had this in your thesis, a plot showing how the performance of several simple algorithms decreased over time as datasets became more complex.
I'm not so concerned about the cheating part. Machine Learning research is vulnerable to many forms of (accidental and actual) cheating, e.g., tuning the random seed, cherry picking tasks. In this sense, I think OpenML has the potential to solve more of these issues (e.g., with a fn 'rerun flow locally') than it creates.
Another way of overfitting to a single task: Using the score calculated by OpenML to do Bayesian optimization.
Heh, yeah. But you could do the same by looking at your test sets. Hence, cheating? I would definitely flag a flow that does that?
On Mon, 19 Mar 2018 at 10:04, Matthias Feurer notifications@github.com wrote:
Another way of overfitting to a single task: Using the score calculated by OpenML to do Bayesian optimization.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/openml/OpenMLFirstBenchmarkSuite/issues/31#issuecomment-374143477, or mute the thread https://github.com/notifications/unsubscribe-auth/ABpQV7yjKPb2DjL_MpGYSZo2fWofFWOOks5tf3ShgaJpZM4StcnP .
-- Thank you, Joaquin
But you wouldn't see this on the website as the user would run BO locally. And it wouldn't be an issue of the flow, but the run. You could tamper with the connector to do this completely local of course, too.
Maybe I'm thinking too far ahead, but there are a few obvious criticisms that we may get from reviewers that are related to how this benchmark is going to be used (assuming we want this to be the reference benchmarks for the field).
2 important issues here that are typical for benchmarking studies:
Maybe we can - to some extent - run the flows on the server and try to reproduce the results, and then add a special label to those runs.
Some ways to alleviate this problem:
In addition, we should also show an aggregated view of the score on the individual tasks (e.g. violin plots?) and do statistical tests? We could do the typical Friedman-Nemenyi test, but not sure if that will work that well on 'only' 80-something datasets.
We could of course wave our hands and say 'yes, but we are only solving the problem of non-standardized benchmark tests and these issues apply to any benchmarking study' but in a way these issues are connected...