ropensci / auunconf

repository for the Australian rOpenSci unconference 2016!
18 stars 4 forks source link

Classification for imbalanced data #39

Open ZhouFang928 opened 8 years ago

ZhouFang928 commented 8 years ago

Recently, I have been exploring the methods for classification on imbalanced data. As I know, the most commonly used technique is a combination of resampling/subsampling techniques plus classification models, like boosted decision tree, random forest, or others. There are various resources online which discuss about this problem, for example, the paper "Handling Imbalanced Data in Customer Churn Prediction Using Combined Sampling and Weighted Random", the blog "8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset", and the GitHub resource https://github.com/topepo/ICHPS2015_Class_Imbalance/commit/master. One question coming into my mind is how imbalanced the data could be and how we can make our model perform better when the proportion of minority class goes down to 10%, 1%, or even 0.1% level. Hope this is also an interesting topic for you.