scikit-learn-contrib / imbalanced-learn

A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning
https://imbalanced-learn.org
MIT License
6.85k stars 1.29k forks source link

[SO] SMOTEEN generates imbalance dataset #1063

Closed jsgounot closed 8 months ago

jsgounot commented 9 months ago

Hi everyone,

I'm fairly new in the machine learning field, so my apologies if the question seems very simple. I'm trying to do some classification on several datasets with some being not well separated. I observed sometimes that SMOTEEN output can result to an even more unbalanced dataset. A small example:

import pandas as pd
import numpy as np

from sklearn.datasets import make_classification

X, y = make_classification(n_samples=100, n_features=10, n_informative=2,
                           n_redundant=0, n_repeated=0, n_classes=2,
                           n_clusters_per_class=1,
                           weights=[0.2, 0.8],
                           class_sep=0.8, random_state=0)

y = pd.Series(y)
y.value_counts()

# Result
# 1    796
# 0    204

You can already see with this dataset that SMOTEEN is not performing well:

from imblearn.combine import SMOTEENN

se = SMOTEENN(random_state=0)
X_se, y_se = se.fit_resample(X, y)
y_se.value_counts()

# Result
# 1    745
# 0    559

When you run the two components separately, we can see the issue here:

s = SMOTE(random_state=0)
X_s, y_s = s.fit_resample(X, y)

e = ENN()
X_enn, y_enn = se.fit_resample(X_s, y_s)

y_s.value_counts()
# 1    796
# 0    796

y_enn.value_counts()
# 1    745
# 0    559

So clearly, the ENN method undersampled way more samples from one class compared to the others, as it treated both class equally after the SMOTE process. I assume here the reason is that there is less variation within the over-sampled class compared to the dominant. In some real datasets, I sometimes even observed the complete disappearance of one of the classes. While I think I understand the reason behind this, I wonder if this is not an issue that some users might not be aware of, as this behavior is completely silent in most cases when used as a pipeline. Is this the intended behavior? Thanks!

chkoar commented 9 months ago

Is this the intended behavior?

Yes, it is. Please note that the intention is not to balance the dataset but to improve the classification performance.

I am referencing the original paper where the combination of SMOTE + Tomek Link and SMOTE + ENN is proposed. Note that in the quotation says TL but in the next paragraph the authors explain why to use ENN instead of TL so ti clearly derived.

Thus, instead of removing only the majority class examples that form Tomek links, examples from both classes are removed.

Having said that, you can pass a custom heuristic or your configured version of your own ENN.

from imblearn.under_sampling import EditedNearestNeighbours
from imblearn.combine import SMOTEENN

custom_enn = ...
smote_enn = SMOTEENN(enn=enn)
jsgounot commented 8 months ago

OK, thanks for taking the time to answer!