openml / openml-r

R package to interface with OpenML
http://openml.github.io/openml-r/
Other
95 stars 37 forks source link

Some of the OpenML100 disappeared #409

Closed PhilippPro closed 6 years ago

PhilippPro commented 6 years ago
> task.ids = listOMLTasks(tag = "OpenML100", estimation.procedure = "10-fold Crossvalidation")$task.id
> length(task.ids)
[1] 92

This broke my complete analysis and is disappointing.

joaquinvanschoren commented 6 years ago

Hi Philipp, They are all still there: https://www.openml.org/search?q=tags.tag%3AOpenML100%2520status%3Aall&type=data

A few of them were fixed however, e.g. binary features that were annotated as numeric are now properly annotated. We created new versions and the old versions were deactivated to avoid confusion. You can still get the originals if you also ask for the deactivated ones.

> task.ids = listOMLTasks(tag = "OpenML100", estimation.procedure = "10-fold Crossvalidation")$task.id
> length(task.ids)
[1] 92
> task.ids = listOMLTasks(tag = "OpenML100", status='deactivated', estimation.procedure = "10-fold Crossvalidation")$task.id
> length(task.ids)
[1] 8

It would be nice if the listOMLTasks had an option to return all tasks regardless of status?

Hope that helps!

giuseppec commented 6 years ago

It would be nice if the listOMLTasks had an option to return all tasks regardless of status?

Probably yes. But does the API allow this?

giuseppec commented 6 years ago

@PhilippPro you can still use this: getOMLStudy(study = "OpenML100") . With studys you can store all information regarding an experiment. You therefore probably want also create your own Study like the ones listed here https://www.openml.org/search?type=study

There is a mini tutorial here https://www.openml.org/guide/benchmark but maybe this here is a better description:

a) To create a benchmark suite, we need to use tasks (not datasets). That is, if there is no task for the corresponding dataset, you have to first create a task out of it (see https://www.openml.org/new/task which is currently only possible through the web interface). b) You have to create a study https://www.openml.org/new/study (I think this is currently also only possible through the web interface) and remember the study ID after you have created the study, you will need the ID for step c). If you set an alias-string when creating the study, it can then also be used to retrieve the benchmark suite (alternatively the study ID can be used, see step d). c) You should add a tag called "study_X" where X = your study ID to the tasks (and datasets), this should be possible by the clients (e.g. R) or through web interface. d) Now you have your benchmark suite. In R, you can get the information using getOMLStudy(IDofStudy) or getOMLStudy("your-alias-string"). Study information can be found online https://www.openml.org/s/IDofStudy

Let me know if you need some more information?

joaquinvanschoren commented 6 years ago

I committed an update to allow status/all: https://github.com/openml/OpenML/commit/5179e125c2d9ad31841d29d1300f50eef1b559c0

Jan still needs to review it.

PhilippPro commented 6 years ago

Thanks a lot for the help. I have to submit the paper today, so I will still use all 39 datasets, although I know that for some the performance is not really tunable, because, e.g., random forest in the default setting already provides perfect prediction. As there is the OpenML100 paper still on ArXiv, you should probably not delete these datasets from the BM suite. Maybe the current solution is the best, but a bit confusing.

giuseppec commented 6 years ago

You can still use all OpenML100 datasets if you already did the study based on those. We will never remove datasets from benchmark suites. Datasets just can get deactivated (but will still be available on OpenML, also all experiments will still be available).