scikit-learn-contrib / imbalanced-learn

A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning
https://imbalanced-learn.org
MIT License
6.85k stars 1.28k forks source link

[ENH] Keep the index of the samples after undersampling #724

Closed qiiiibeau closed 1 year ago

qiiiibeau commented 4 years ago

Hello, I'm undersampling some imbalanced data with each sample a unique name as index. I don't want to lose the samples' index after undersampling because I'm doing a graph - based task where each sample represent a node, I need to know where it is located in the graph.

To be more illustrative, my data is a dataframe looks like: feat_1 feat_2 feat_3 label
Thomas 0.5 2.2 3.0 1
Kelly 0.63 1.5 1.4 0
Peter 0.9 1.1 3.4 1
George 0.2 2.1 4 1
... ... ... ... ...
The current version of imblearn undersampling methods e.g. RandomUnderSampler().fit_resample() returns me a dataframe with index [0: length of selected samples] such as feat_1 feat_2 feat_3 label
0 0.5 2.2 3.0 1
1 0.2 2.1 4 1

where all the original index are lost. I need it to be like:

feat_1 feat_2 feat_3 label
Thomas 0.5 2.2 3.0 1
George 0.2 2.1 4 1

This improvement would help a lot for graph-based imbalanced learning and maybe also in other cases.

Thank you.

glemaitre commented 3 years ago

We might want to add support for this feature for samplers having a fitted attribute sample_indices_ after fit. Otherwise, the index is meaningless. However, it makes the behaviour different from one sampler to another while a user can easily reassign an index which would be less surprising:

df_res, y_res = sampler.fit_resample(df, y)
df_res.index = df.index[sampler.sample_indices_]

@chkoar do you have any thought on this?

glemaitre commented 1 year ago

This feature was added for the RandomUnderSampler and RandomOverSampler.