[ENH] Keep the index of the samples after undersampling

qiiiibeau commented 4 years ago

Hello, I'm undersampling some imbalanced data with each sample a unique name as index. I don't want to lose the samples' index after undersampling because I'm doing a graph - based task where each sample represent a node, I need to know where it is located in the graph.

To be more illustrative, my data is a dataframe looks like:		feat_1	feat_2	feat_3
Thomas	0.5	2.2	3.0	1
Kelly	0.63	1.5	1.4	0
Peter	0.9	1.1	3.4	1
George	0.2	2.1	4	1
...	...	...	...	...

The current version of imblearn undersampling methods e.g. `RandomUnderSampler().fit_resample()` returns me a dataframe with index [0: length of selected samples] such as		feat_1	feat_2	feat_3	label
0	0.5	2.2	3.0	1
1	0.2	2.1	4	1

where all the original index are lost. I need it to be like:

	feat_1	feat_2	feat_3	label
Thomas	0.5	2.2	3.0	1
George	0.2	2.1	4	1

This improvement would help a lot for graph-based imbalanced learning and maybe also in other cases.

Thank you.

glemaitre commented 3 years ago

We might want to add support for this feature for samplers having a fitted attribute sample_indices_ after fit. Otherwise, the index is meaningless. However, it makes the behaviour different from one sampler to another while a user can easily reassign an index which would be less surprising:

df_res, y_res = sampler.fit_resample(df, y)
df_res.index = df.index[sampler.sample_indices_]

@chkoar do you have any thought on this?

glemaitre commented 1 year ago

This feature was added for the RandomUnderSampler and RandomOverSampler.

scikit-learn-contrib / imbalanced-learn

[ENH] Keep the index of the samples after undersampling #724