scikit-learn-contrib / imbalanced-learn

A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning
https://imbalanced-learn.org
MIT License
6.85k stars 1.29k forks source link

[BUG] RandomUnderSampler does not keep original order #1043

Closed rgtzths closed 1 year ago

rgtzths commented 1 year ago

Describe the bug

The random undersampler does not keep the original order of the data. This is troublesome when the data is desired to keep the original order as much as possible.

Steps/Code to Reproduce

import numpy as 
from imblearn.under_sampling import RandomUnderSampler

seed = 42
x = np.array([[1,1,1], [2,2,2], [3,3,3], [4,4,4]])
y = np.array([1, 0, 0, 0])
rus = RandomUnderSampler(random_state=seed, sampling_strategy=0.5)

rus.fit_resample(x, y)

Expected Results

(array([[1, 1, 1]], [2, 2, 2], [3, 3, 3], ), array([1, 0, 0]))

Actual Results

(array([[2, 2, 2], [3, 3, 3], [1, 1, 1]]), array([0, 0, 1]))

Versions

numpy=1.23.5 imblearn=0.0

glemaitre commented 1 year ago

I would not consider as a bug since you have an index if you pass a pandas dataframe or as well the information via the attribute sample_indices_. Making a sorting will be an extra-costly step that is not useful for everyone.

So I would let the user sorting once the sampling is done.

rgtzths commented 1 year ago

It would not be that costly, as it can be solved by sorting the selected indexes before sampling. Instead of sorting the sampled dataset (which is more expensive).

The idea was to enable sorting the index before sampling, which could be solved with the following code. self.sample_indices_ = sorted(idx_under) instead of what we have now self.sample_indices_ = idx_under

However, if you still consider it too expensive to perform, I will sort it after the sampling is performed.