Internal Task Format - Githubissues

janvanrijn commented 5 years ago

Looking at the current OpenML internal task implementation, I encountered several problems.

The database model is sub-optimal
As one table facilitates now the storage of all task types with all possible values, there are currently no database integrity checks on the data (in fact, the value field in the database is a TEXT value of at most 64k characters)
This makes indexing complicated
This relies completely on my PHP integrity checks (which are due to the almost free for all structure pretty ad-hoc)
I already found some inconsistencies on data level (hopefully/presumably these were introduced before I implemented the integrity checks).

This is the current database table scheme: task_inputs

task_id (int, fk to table task)
input (varchar, not really a real foreign key but somewhat related to table task_io_types)
value (text, free for all) (PK: fields task_id and input)

I would propose the following changes to the structure:

make field input a true foreign key to the table task_io_types (and insert an int reference instead of textual reference)
make value field an integer value (unfortunately, we can't allow foreign keys here without exploding the number of tables)

Especially, the last will have serious consequences. First of all, all current task inputs need to be converted to integer format (i.e., a key in another table). Second, some of them may not be able to be converted. For example, the custom_holdoutset. I never liked this feature anyway, it's not well-tested, and barely used. The following query shows the tasks that do make use of this feature, which are almost all (deactivated) datasets.

SELECT task_id, COUNT(*) FROM run WHERE task_id IN (SELECT task_id FROM `task_inputs` WHERE input = "custom_testset") GROUP BY task_id ORDER BY COUNT(*)

joaquinvanschoren commented 5 years ago

I completely agree with simplifying the task structure.

However, I would really like to have a way to allow custom holdouts, even if we do it in a completely different way than we do now. There are very good reasons to have custom holdouts (e.g. medical data with special cross-validation splits, benchmark datasets where a test set has been agreed,...), and they are also required whenever someone wants to upload existing tasks and experiments. I do agree that the current way to create them is not so practical. Maybe users should upload a split file instead?

janvanrijn commented 5 years ago

However, I would really like to have a way to allow custom holdouts, even if we do it in a completely different way than we do now.

Fair, but so far we don't have any use-cases of people doing so. Except the QSAR project, and I don't understand why they do it the current way, as we have several times met with them on workshops and agreed on the OpenML task format.

Maybe users should upload a split file instead?

I am slightly against this, as it (i) requires a different type of check (arff fields, values) and (ii) requires us to store an additional type of entity (currently we don't really store split files, we rather cache them; potentially they may all be removed whenever).

I am open to other solutions, but since over the course of 7 years we haven't had a single use-case that genuinely needed it, I feel reluctant to keep supporting this feature at the expense of other features (maintainability burden on many parts of the system are already quite high. this feature adds to it.)

mfeurer commented 5 years ago

However, I would really like to have a way to allow custom holdouts, even if we do it in a completely different way than we do now.

I agree that it would be good to have the original splits for such prominent datasets as MNIST.

Fair, but so far we don't have any use-cases of people doing so.

And I agree with this too. If it helps to maintain the platform in the current way I am totally in favor of @janvanrijn's proposal.

Maybe users should upload a split file instead?

Would your proposed layout change allow to add this in the future? Maybe you could already take precautions on the database level so this can be easily added later?

openml / OpenML

Internal Task Format #913