openml / OpenML

Open Machine Learning
https://openml.org
BSD 3-Clause "New" or "Revised" License
668 stars 91 forks source link

Internal Task Format #913

Open janvanrijn opened 5 years ago

janvanrijn commented 5 years ago

Looking at the current OpenML internal task implementation, I encountered several problems.

This is the current database table scheme: task_inputs

I would propose the following changes to the structure:

Especially, the last will have serious consequences. First of all, all current task inputs need to be converted to integer format (i.e., a key in another table). Second, some of them may not be able to be converted. For example, the custom_holdoutset. I never liked this feature anyway, it's not well-tested, and barely used. The following query shows the tasks that do make use of this feature, which are almost all (deactivated) datasets.

SELECT task_id, COUNT(*) FROM run WHERE task_id IN (SELECT task_id FROM `task_inputs` WHERE input = "custom_testset") GROUP BY task_id ORDER BY COUNT(*) 
joaquinvanschoren commented 5 years ago

I completely agree with simplifying the task structure.

However, I would really like to have a way to allow custom holdouts, even if we do it in a completely different way than we do now. There are very good reasons to have custom holdouts (e.g. medical data with special cross-validation splits, benchmark datasets where a test set has been agreed,...), and they are also required whenever someone wants to upload existing tasks and experiments. I do agree that the current way to create them is not so practical. Maybe users should upload a split file instead?

janvanrijn commented 5 years ago

However, I would really like to have a way to allow custom holdouts, even if we do it in a completely different way than we do now.

Fair, but so far we don't have any use-cases of people doing so. Except the QSAR project, and I don't understand why they do it the current way, as we have several times met with them on workshops and agreed on the OpenML task format.

Maybe users should upload a split file instead?

I am slightly against this, as it (i) requires a different type of check (arff fields, values) and (ii) requires us to store an additional type of entity (currently we don't really store split files, we rather cache them; potentially they may all be removed whenever).

I am open to other solutions, but since over the course of 7 years we haven't had a single use-case that genuinely needed it, I feel reluctant to keep supporting this feature at the expense of other features (maintainability burden on many parts of the system are already quite high. this feature adds to it.)

mfeurer commented 5 years ago

However, I would really like to have a way to allow custom holdouts, even if we do it in a completely different way than we do now.

I agree that it would be good to have the original splits for such prominent datasets as MNIST.

Fair, but so far we don't have any use-cases of people doing so.

And I agree with this too. If it helps to maintain the platform in the current way I am totally in favor of @janvanrijn's proposal.

Maybe users should upload a split file instead?

Would your proposed layout change allow to add this in the future? Maybe you could already take precautions on the database level so this can be easily added later?