Open janvanrijn opened 5 years ago
I completely agree with simplifying the task structure.
However, I would really like to have a way to allow custom holdouts, even if we do it in a completely different way than we do now. There are very good reasons to have custom holdouts (e.g. medical data with special cross-validation splits, benchmark datasets where a test set has been agreed,...), and they are also required whenever someone wants to upload existing tasks and experiments. I do agree that the current way to create them is not so practical. Maybe users should upload a split file instead?
However, I would really like to have a way to allow custom holdouts, even if we do it in a completely different way than we do now.
Fair, but so far we don't have any use-cases of people doing so. Except the QSAR project, and I don't understand why they do it the current way, as we have several times met with them on workshops and agreed on the OpenML task format.
Maybe users should upload a split file instead?
I am slightly against this, as it (i) requires a different type of check (arff fields, values) and (ii) requires us to store an additional type of entity (currently we don't really store split files, we rather cache them; potentially they may all be removed whenever).
I am open to other solutions, but since over the course of 7 years we haven't had a single use-case that genuinely needed it, I feel reluctant to keep supporting this feature at the expense of other features (maintainability burden on many parts of the system are already quite high. this feature adds to it.)
However, I would really like to have a way to allow custom holdouts, even if we do it in a completely different way than we do now.
I agree that it would be good to have the original splits for such prominent datasets as MNIST.
Fair, but so far we don't have any use-cases of people doing so.
And I agree with this too. If it helps to maintain the platform in the current way I am totally in favor of @janvanrijn's proposal.
Maybe users should upload a split file instead?
Would your proposed layout change allow to add this in the future? Maybe you could already take precautions on the database level so this can be easily added later?
Looking at the current OpenML internal task implementation, I encountered several problems.
value
field in the database is a TEXT value of at most 64k characters)This is the current database table scheme: task_inputs
task
)task_io_types
)task_id
andinput
)I would propose the following changes to the structure:
input
a true foreign key to the tabletask_io_types
(and insert an int reference instead of textual reference)Especially, the last will have serious consequences. First of all, all current task inputs need to be converted to integer format (i.e., a key in another table). Second, some of them may not be able to be converted. For example, the custom_holdoutset. I never liked this feature anyway, it's not well-tested, and barely used. The following query shows the tasks that do make use of this feature, which are almost all (deactivated) datasets.