Closed salamanders closed 8 years ago
I would give a look at this
In a nutshell --- from what the authors says --- over-sampling is better than under-sampling with high level of imbalancing. Otherwise, there is not that much differences.
From my experience, NearMiss was working kinda of the best on our problem.
@salamanders SMOTE (oversampling) is usually my "go to" approach, followed by One-Sided-Selection (undersampling). In my experience, it depends on the amount of data and imbalance ratio you have.
But keep in mind that it depends, there is no "killer" approach. I once used SMOTE and One Sided Selection in a large dataset (~100K samples/ 4K features), which ended up achieving the same results as RandomOverSampling of the minority class and RandomUnderSampling of the majority. When you're dealing with large datasets, time is also an important aspect.
Best of luck,
@salamanders Feel free to reopen the issue or to go on gitter for further information.
Is there a best method to try first? I'd guess oversample, but no idea which method. Any qualify as "not a bad first choice, of course your data matters and each have strength and weaknesses, but might as well start with...."