mljar / mljar-supervised

Python package for AutoML on Tabular Data with Feature Engineering, Hyper-Parameters Tuning, Explanations and Automatic Documentation
https://mljar.com
MIT License
3k stars 401 forks source link

Minor class in multi-class classification with too few samples to stratify #215

Closed pplonski closed 3 years ago

pplonski commented 3 years ago

In the case of too few samples to perform stratification there is an error thrown:

The least populated class in y has only 2 members, which is less than n_splits=5.

Maybe we can detect such situations and upsample minor classes? For sure it is related to https://github.com/mljar/mljar-supervised/issues/157 However, this issue requires rather a quick fix and #157 requires a larger treatment of unbalanced datasets.

shahules786 commented 3 years ago

Should auto ml do upsampling w/o the user concern? Anyway, the model isn't going to learn anything for that class. Isn't it better to inform the user through a warning that stratification is not possible? @pplonski

pplonski commented 3 years ago

I've added a function _handle_drastic_imbalance() that assures that there is always at least 20 (or k_folds if set) samples per class.