Simplify data splits for classification/regression tasks

Classification and Regression tasks feature estimation procedures: (ordered) holdout, r-repeated k-fold cross-validation and test on training data. Currently the split files are organized as following (ARFF notation):

@attribute type {TRAIN,TEST}
@attribute rowid numeric
@attribute repeat numeric
@attribute fold numeric

I think the type column is not necessary for the supported evaluation strategies and produces unnecessary duplication (and hence server strain and bandwidth). Looking at the split file, we see that every row in the dataset introduces R*K rows in the split file for R-repeated K-fold cross-validation. If we drop the type column and simply indicate which fold each sample belongs to for each repeat, the data-split should be a factor K smaller (plus benefits of not having the TRAIN/TEST column data). For the holdout tasks it would not be obvious which fold is train or test. However we could either adopt a convention here (0 is train, 1 is test), or allow this to be described explicitly in the task description xml. Similarly if we want to preserve the order in which folds in k-fold cv are evaluated.

Having different split file formats for different task types is the norm, learning curve tasks introduce the sample column and other types (e.g. Clustering) don't even have split files at all. I don't see an issue with changing the split file for this case (other than making openml packages adopt to the changes, and have users refresh their cache).

openml / OpenML

Simplify data splits for classification/regression tasks #1110