return index of resampled dataset

shuuchen commented 6 years ago

Hi there I am new here and trying the following examples

from collections import Counter

from sklearn.datasets import load_iris
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split

from imblearn.datasets import make_imbalance
from imblearn.under_sampling import NearMiss
from imblearn.pipeline import make_pipeline
from imblearn.metrics import classification_report_imbalanced

print(__doc__)

RANDOM_STATE = 42

# Create a folder to fetch the dataset
iris = load_iris()
X, y = make_imbalance(iris.data, iris.target, ratio={0: 25, 1: 50, 2: 50},
                      random_state=0)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=RANDOM_STATE)

print('Training target statistics: {}'.format(Counter(y_train)))
print('Testing target statistics: {}'.format(Counter(y_test)))

# Create a pipeline
pipeline = make_pipeline(NearMiss(version=2, random_state=RANDOM_STATE),
                         LinearSVC(random_state=RANDOM_STATE))
pipeline.fit(X_train, y_train)

# Classify and report the results
print(classification_report_imbalanced(y_test, pipeline.predict(X_test)))

Since make_imbalance returns resampled X and y, how about their index ? Can the index be returned as well?

Thanks !

glemaitre commented 6 years ago

Why would it be interesting to know those?

glemaitre commented 6 years ago

What I mean is that this util function is to create synthetically imbalanced dataset. So there is no interest to actually know which samples are missing, isn't it?

shuuchen commented 6 years ago

Thanks for the reply. It is very important to know which data are resampled and which are not, especially in some real life data analysis projects, in which the index really have some meaning. For example, in time series data, the time when the data is generated is as important as the data itself. It can help us understand the data better and develop more efficient algorithms, especially when working with Pandas.

glemaitre commented 6 years ago

I am still not convinced. When you want to understand your algorithm or make them more efficient, you are interested in the under-sampler or over-sampler methods. In the case of the under-sampler, we provide an option which allows to return the indices whenever this is possible.

make_imbalance is a toy utility to create an imbalanced dataset, so there nothing interesting to know how or why the sample are generating. Now this function is just a random sampling anyway.

For example, in time series data, the time when the data is generated is as important as the data itself.

So you probably don't want to use the make imbalance if you want to drive your under-sampling before to apply a sampler. The random-under-sampling will have no meaning in this case.

shuuchen commented 6 years ago

Yes, you are right! We concentrate on the under-sampled data sets. But we also need the index for further analysis and other usages. So if the index information are kept, it will be much more convenient for many use cases.

In the case of the under-sampler, we provide an option which allows to return the indices whenever this is possible.

So how to specify the option?

make_imbalance is a toy utility to create an imbalanced dataset, so there nothing interesting to know how or why the sample are generating. Now this function is just a random sampling anyway.

I mean, the problem is not whether make_imbalance is random sampling, the problem is whether it can return the index. Not only for make_imbalance but also for other non-random sampling methods. I am sorry if random sampling dosen't make sense. I just take an example.

I would appreciate so much if you can add index information to the return value of resampling functions.

glemaitre commented 6 years ago

                                                                                  Can you check the under sampling methods and more precisely the parameter return_indices which by default is False and can be set to True. It will allow to return the associated indices as you need for most of the under sampler                                                                                                                                                                                  ‎

shuuchen commented 6 years ago

Thank you very much! I will check it.

glemaitre commented 6 years ago

@shuuchen Can we close this issue?

shuuchen commented 6 years ago

OK! I have got the indices. Thanks very much!

glemaitre commented 6 years ago

Good! Closing then.

markdregan commented 2 years ago

An argument for returning the index for over sampling.

I want to pass sample_weight to my classifier. This is not a feature of the dataset (X_train) nor a part of y_train. So by not having the index returned by SMOTENC, I can'teasily get the sample_weight column from the original dataset.

A workaround would be to add it as a feature of X_train and then extract it after.

scikit-learn-contrib / imbalanced-learn

return index of resampled dataset #376