scikit-learn-contrib / imbalanced-learn

A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning
https://imbalanced-learn.org
MIT License
6.85k stars 1.29k forks source link

[DOC] User warning over sampling methods #1101

Open lcrmorin opened 1 month ago

lcrmorin commented 1 month ago

Describe the issue linked to the documentation

There is some discussion going on about the usefulness of some (if not all) over / under sampling methods implemented in the imbalanced learn package.

Typically there is some doubt about the usefulness of SMOTE:

Basically it seems that:

I think that it is a problem that those discussions are not more visible to the newcomers. (And that more experienced people need to have to deal with that on a weekly basis).

Suggest a potential alternative/fix

It would be nice to have

1) a clearer demonstration in the doc, because for the moment only the usage is described:

from collections import Counter
from sklearn.datasets import make_classification
from imblearn.over_sampling import SMOTE
X, y = make_classification(n_classes=2, class_sep=2,
weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,
n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)
print('Original dataset shape %s' % Counter(y))
sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X, y)
print('Resampled dataset shape %s' % Counter(y_res))

It shows that it oversampled, but not that it works either in terms or ranking (AUC) / probability calibration (ECE / calibration curve).

Could the doc be upgraded with a better exemple ?

2) a visible user warning regarding the discussions on usefulness of these methods.

While (one of the) authors have changed its mind about the usefulness of these methods, it seems that a younger crowd is still very eager to jump on these shiny methods. I think it would be helpful for the DS community to make a clearer stance.

I would suggest at least a very visible warning in the doc, like a red banner ('there are some discussion about the usefulness of these methods. See: XXX. Use with caution').

This could be expanded with a UserWarning... may be a bit brutal but it could prevent a lot of trouble.

Edit: not sure why it added the good first issue automatically... but I'll take it.

glemaitre commented 1 month ago

Basically, we are also working in scikit-learn on this topic. As milestone, we want to have an example that show the effect of sample-weight and class-weight in scikit-learn and then I would like to revamp the documentation of imbalanced-learn.

lcrmorin commented 1 month ago

Thanks for the answer. Implementation and documentation within sklearn seems to be the way to go in the long run. Maybe in the short term this on-going work should be documented a bit more visibly... a lot of newcomers are still pushing SMOTE and the likes.