Open PGijsbers opened 3 years ago
As icing on the cake we probably should also make these parquet files, of course :) And perhaps consider saving the files after they are generated, so they don't need to generated on the fly (edit: looks like this is already done). Perhaps the new implementation would be fast enough for it not to matter, but with the current implementation task splits for large datasets take longer than even downloading the dataset itself.
Classification and Regression tasks feature estimation procedures: (ordered) holdout, r-repeated k-fold cross-validation and test on training data. Currently the split files are organized as following (ARFF notation):
I think the
type
column is not necessary for the supported evaluation strategies and produces unnecessary duplication (and hence server strain and bandwidth). Looking at the split file, we see that every row in the dataset introduces R*K rows in the split file for R-repeated K-fold cross-validation. If we drop thetype
column and simply indicate which fold each sample belongs to for each repeat, the data-split should be a factor K smaller (plus benefits of not having theTRAIN/TEST
column data). For the holdout tasks it would not be obvious which fold is train or test. However we could either adopt a convention here (0 is train, 1 is test), or allow this to be described explicitly in the task description xml. Similarly if we want to preserve the order in which folds in k-fold cv are evaluated.Having different split file formats for different task types is the norm, learning curve tasks introduce the
sample
column and other types (e.g. Clustering) don't even have split files at all. I don't see an issue with changing the split file for this case (other than making openml packages adopt to the changes, and have users refresh their cache).