scikit-learn-contrib / imbalanced-learn

A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning
https://imbalanced-learn.org
MIT License
6.85k stars 1.28k forks source link

[BUG] NearMiss version 3 does not work well with sampling_strategy=dictionary #836

Closed miguelBra closed 1 year ago

miguelBra commented 3 years ago

Describe the bug

Undersampling with NearMiss version 3 does not work well with sampling_strategy=dictionary.

A potential explanation could be that the first step of the algorithm already performs an intense undersampling, leaving a number of observations to be undersampled in the second step that is already lower than the number specified in the dictionary. As a consequence, the algortithm only seems to work if the number of desired samples is very low in comparison to the existing samples. The code examples below show how, for a class of 357 samples, NearMiss3 does not work if the desired number of samples is 300 but it does work if the desired number of samples is 50.

I don't think this is a desirable feature in the algorithm, especially considering that the 3 versions of NearMiss are presented in the documentation as methods that allow to specify the number of samples to have in each class. Anyway, I think that at least it could be good to explain this in the documentation for saving time to people who find this problem (I have lost several hours trying to figure out what was happening).

Steps/Code to Reproduce

Example 1. Undersampling to 300 observations (this doesn't work):


from sklearn.datasets import load_breast_cancer
import pandas as pd
from imblearn.under_sampling import NearMiss

data = load_breast_cancer()
X = pd.DataFrame(data=data.data, columns=data.feature_names)

# class 1 has clearly more than 300 observations
np.unique(data.target, return_counts = True)

X_smt, y_smt = NearMiss(version=3, sampling_strategy={1: 300}).fit_resample(X, data.target)

Example 2. Undersampling to 50 observations (this works well):


from sklearn.datasets import load_breast_cancer
import pandas as pd
from imblearn.under_sampling import NearMiss

data = load_breast_cancer()
X = pd.DataFrame(data=data.data, columns=data.feature_names)

X_smt, y_smt = NearMiss(version=3, sampling_strategy={1: 50}).fit_resample(X, data.target)
np.unique(y_smt, return_counts = True) # it worked

Expected Results

In the first example, the resulting dataset (X_smt, y_smt) should have 300 samples for class 1. In the second example, class 1 should have 50 samples.

Actual Results

The code in Example 1 raises: "UserWarning: The number of the samples to be selected is larger than the number of samples available. The balancing ratio cannot be ensure and all samples will be returned."

The code in Example 2 works well.

Versions

Linux-5.10.15-200.fc33.x86_64-x86_64-with-glibc2.2.5 Python 3.8.6 (default, Nov 10 2011, 15:00:00) [GCC 10.2.0] NumPy 1.19.5 SciPy 1.6.1 Scikit-Learn 0.24.1 Imbalanced-Learn 0.8.0

glemaitre commented 1 year ago

This is expected but we should document it. Anyway, we are going to deprecate because this method is actually not NearMiss 3

glemaitre commented 1 year ago

See #980