Closed shuuchen closed 6 years ago
Why would it be interesting to know those?
What I mean is that this util function is to create synthetically imbalanced dataset. So there is no interest to actually know which samples are missing, isn't it?
Thanks for the reply. It is very important to know which data are resampled and which are not, especially in some real life data analysis projects, in which the index really have some meaning. For example, in time series data, the time when the data is generated is as important as the data itself. It can help us understand the data better and develop more efficient algorithms, especially when working with Pandas.
I am still not convinced. When you want to understand your algorithm or make them more efficient, you are interested in the under-sampler or over-sampler methods. In the case of the under-sampler, we provide an option which allows to return the indices whenever this is possible.
make_imbalance
is a toy utility to create an imbalanced dataset, so there nothing interesting to know how or why the sample are generating. Now this function is just a random sampling anyway.
For example, in time series data, the time when the data is generated is as important as the data itself.
So you probably don't want to use the make imbalance if you want to drive your under-sampling before to apply a sampler. The random-under-sampling will have no meaning in this case.
Yes, you are right! We concentrate on the under-sampled data sets. But we also need the index for further analysis and other usages. So if the index information are kept, it will be much more convenient for many use cases.
In the case of the under-sampler, we provide an option which allows to return the indices whenever this is possible.
So how to specify the option?
make_imbalance is a toy utility to create an imbalanced dataset, so there nothing interesting to know how or why the sample are generating. Now this function is just a random sampling anyway.
I mean, the problem is not whether make_imbalance is random sampling, the problem is whether it can return the index. Not only for make_imbalance but also for other non-random sampling methods. I am sorry if random sampling dosen't make sense. I just take an example.
I would appreciate so much if you can add index information to the return value of resampling functions.
Can you check the under sampling methods and more precisely the parameter return_indices which by default is False and can be set to True. It will allow to return the associated indices as you need for most of the under sampler
Thank you very much! I will check it.
@shuuchen Can we close this issue?
OK! I have got the indices. Thanks very much!
Good! Closing then.
An argument for returning the index for over sampling.
I want to pass sample_weight to my classifier. This is not a feature of the dataset (X_train) nor a part of y_train. So by not having the index returned by SMOTENC, I can'teasily get the sample_weight column from the original dataset.
A workaround would be to add it as a feature of X_train and then extract it after.
Hi there I am new here and trying the following examples
Since make_imbalance returns resampled X and y, how about their index ? Can the index be returned as well?
Thanks !