mljar / mljar-supervised

Python package for AutoML on Tabular Data with Feature Engineering, Hyper-Parameters Tuning, Explanations and Automatic Documentation
https://mljar.com
MIT License
3.05k stars 406 forks source link

Adjust validation type based on dataset #249

Closed pplonski closed 3 years ago

pplonski commented 3 years ago

Adjust cross-validation type based on the dataset

pplonski commented 3 years ago

We adjust the validation type based on number of cells in the data.

cells =  rows * cols

pseudo code to adjust validation:

if cells > 100e6:
  validation with split
elif cells > 50e6:
  validation with 5-folds
else:
  validation with 10-folds
pplonski commented 3 years ago

I've changed the approach to set the validation. It is set based on the training time of the Decision Tree algorithm on train/test split 0.9/0.1 of data. If the mode=Compete then we first train a Decision Tree. Then we assume that other models will be trained in about 5x time of Decision Tree time. And we assume that we would like to have at least 10 models. Based on total_train_limit and above we compute the rough number of folds. Then if 5 < folds < 15 we used 5-fold CV, if folds > 15 we used 10-fold CV. Otherwise, we continue with a 0.9/0.1 train/test split.

pplonski commented 3 years ago

https://github.com/mljar/mljar-supervised/commit/f7105cdf057ecb0ba68814b3a16a4f25a24ad876