Adjust validation type based on dataset

pplonski commented 3 years ago

Adjust cross-validation type based on the dataset

pplonski commented 3 years ago

We adjust the validation type based on number of cells in the data.

cells =  rows * cols

pseudo code to adjust validation:

if cells > 100e6:
  validation with split
elif cells > 50e6:
  validation with 5-folds
else:
  validation with 10-folds

pplonski commented 3 years ago

I've changed the approach to set the validation. It is set based on the training time of the Decision Tree algorithm on train/test split 0.9/0.1 of data. If the mode=Compete then we first train a Decision Tree. Then we assume that other models will be trained in about 5x time of Decision Tree time. And we assume that we would like to have at least 10 models. Based on total_train_limit and above we compute the rough number of folds. Then if 5 < folds < 15 we used 5-fold CV, if folds > 15 we used 10-fold CV. Otherwise, we continue with a 0.9/0.1 train/test split.

pplonski commented 3 years ago

https://github.com/mljar/mljar-supervised/commit/f7105cdf057ecb0ba68814b3a16a4f25a24ad876

mljar / mljar-supervised

Adjust validation type based on dataset #249