openml / OpenML

Open Machine Learning
https://openml.org
BSD 3-Clause "New" or "Revised" License
662 stars 91 forks source link

Simplify data splits for classification/regression tasks #1110

Open PGijsbers opened 3 years ago

PGijsbers commented 3 years ago

Classification and Regression tasks feature estimation procedures: (ordered) holdout, r-repeated k-fold cross-validation and test on training data. Currently the split files are organized as following (ARFF notation):

@attribute type {TRAIN,TEST}
@attribute rowid numeric
@attribute repeat numeric
@attribute fold numeric

I think the type column is not necessary for the supported evaluation strategies and produces unnecessary duplication (and hence server strain and bandwidth). Looking at the split file, we see that every row in the dataset introduces R*K rows in the split file for R-repeated K-fold cross-validation. If we drop the type column and simply indicate which fold each sample belongs to for each repeat, the data-split should be a factor K smaller (plus benefits of not having the TRAIN/TEST column data). For the holdout tasks it would not be obvious which fold is train or test. However we could either adopt a convention here (0 is train, 1 is test), or allow this to be described explicitly in the task description xml. Similarly if we want to preserve the order in which folds in k-fold cv are evaluated.

Having different split file formats for different task types is the norm, learning curve tasks introduce the sample column and other types (e.g. Clustering) don't even have split files at all. I don't see an issue with changing the split file for this case (other than making openml packages adopt to the changes, and have users refresh their cache).

PGijsbers commented 3 years ago

As icing on the cake we probably should also make these parquet files, of course :) And perhaps consider saving the files after they are generated, so they don't need to generated on the fly (edit: looks like this is already done). Perhaps the new implementation would be fast enough for it not to matter, but with the current implementation task splits for large datasets take longer than even downloading the dataset itself.