scikit-learn-contrib / imbalanced-learn

A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning
https://imbalanced-learn.org
MIT License
6.75k stars 1.27k forks source link

Add method for regression #571

Open chjq201410695 opened 5 years ago

chjq201410695 commented 5 years ago

As title. and I find a method in R as following: https://github.com/paobranco/Pre-processingApproachesImbalanceRegression

and paper as : https://www.semanticscholar.org/paper/SMOTE-for-Regression-Torgo-Ribeiro/43cda672b9ac0833086e19c90d42c2c0fbc361c6

glemaitre commented 5 years ago

I am not opposed to it.

glemaitre commented 5 years ago

closing in favor of #105

bwang482 commented 4 years ago

Hi @glemaitre am I right that currently only BalancedRandomForestClassifier from imblearn.ensemble can take real numbers as y for regression problems? Other ensemble models such as RUSBoostClassifier cannot do this? The oversampling strategies cannot do this either?

Thanks!

chkoar commented 4 years ago

Hi @glemaitre am I right that currently only BalancedRandomForestClassifier from imblearn.ensemble can take real numbers as y for regression problems? Other ensemble models such as RUSBoostClassifier cannot do this?

@bluemonk482 the name of the models you mentioned ends with Classifier. That implies that are applicable in classification tasks.

The oversampling strategies cannot do this either?

Currently no, but we are interested on including an implementation of such a method.

bwang482 commented 4 years ago

Thanks @chkoar !

I assume it is more complex than simply changing class BalancedRandomForestClassifier(RandomForestClassifier) to class BalancedRandomForestClassifier(RandomForestRegressor) in https://github.com/scikit-learn-contrib/imbalanced-learn/blob/c0aa81c40173bd28b863ccc1b82bbafcacb240c4/imblearn/ensemble/_forest.py ???

glemaitre commented 4 years ago

Yes because you need to understand and make a proper resampling strategy in the context of regression which is not really straightforward and there is almost no literature on this.

On Tue, 30 Jul 2019 at 15:13, bluemonk482 notifications@github.com wrote:

Thanks @chkoar https://github.com/chkoar !

I assume it is more complex than simply changing class BalancedRandomForestClassifier(RandomForestClassifier) to class BalancedRandomForestClassifier(RandomForestRegressor) in https://github.com/scikit-learn-contrib/imbalanced-learn/blob/c0aa81c40173bd28b863ccc1b82bbafcacb240c4/imblearn/ensemble/_forest.py ???

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/imbalanced-learn/issues/571?email_source=notifications&email_token=ABY32P44ML33YLHD4EI62A3QCA5A3A5CNFSM4HNZNXWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3D5MVQ#issuecomment-516413014, or mute the thread https://github.com/notifications/unsubscribe-auth/ABY32P2YI5JL4TJ4OGTZV43QCA5A3ANCNFSM4HNZNXWA .

-- Guillaume Lemaitre INRIA Saclay - Parietal team Center for Data Science Paris-Saclay https://glemaitre.github.io/

bwang482 commented 4 years ago

Understood. Thanks @glemaitre !

akatav commented 4 years ago

@glemaitre this thread is such a godsend for me! so, i understand there is no way presently to generate synthetic data for regression problems where obviously the output variable Y is a continuous value. is that correct ? Can the expert machine learners here suggest some way out of this sort of a problem then? more details included in my post - https://stats.stackexchange.com/questions/433740/regression-on-unevenly-distributed-high-dimensional-dataset

glemaitre commented 4 years ago

I reopen this issue, we could make a generic tool which would quantize the target and allow to apply any sampler. We could think about a meta-estimator to do the job. This would require what is called a relevance function.

ogencoglu commented 4 years ago

I believe these are relevant for this issue:

glevv commented 3 years ago

https://github.com/paobranco

She wrote several papers on the topic and has some of them implemented in R.

glevv commented 3 years ago

I think the most simple way to do it without adding new methods, is to discretize target (uniformly or kmeans, quantiles won't do), then fit oversampler and then make an inverse transform (assign midrange bin values instead of bin numbers).

It should work through Pipeline and TargetTransformer.

pavelkomarov commented 2 years ago

I also vote for SMOTER. I don't want to have to download a different package https://pypi.org/project/smogn/ to do SMOTE with regression problems.