scikit-learn-contrib / imbalanced-learn

A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning
https://imbalanced-learn.org
MIT License
6.85k stars 1.28k forks source link

Which should I try first? #128

Closed salamanders closed 8 years ago

salamanders commented 8 years ago

Is there a best method to try first? I'd guess oversample, but no idea which method. Any qualify as "not a bad first choice, of course your data matters and each have strength and weaknesses, but might as well start with...."

glemaitre commented 8 years ago

I would give a look at this

In a nutshell --- from what the authors says --- over-sampling is better than under-sampling with high level of imbalancing. Otherwise, there is not that much differences.

From my experience, NearMiss was working kinda of the best on our problem.

dvro commented 8 years ago

@salamanders SMOTE (oversampling) is usually my "go to" approach, followed by One-Sided-Selection (undersampling). In my experience, it depends on the amount of data and imbalance ratio you have.

But keep in mind that it depends, there is no "killer" approach. I once used SMOTE and One Sided Selection in a large dataset (~100K samples/ 4K features), which ended up achieving the same results as RandomOverSampling of the minority class and RandomUnderSampling of the majority. When you're dealing with large datasets, time is also an important aspect.

Best of luck,

glemaitre commented 8 years ago

@salamanders Feel free to reopen the issue or to go on gitter for further information.