Understanding how to work with unbalanced datasets

bkowshik commented 7 years ago

You get 90% accuracy for your model and you are like "awesome!" until you find out, well 90% of the data was all on one class. This is called an "unbalanced dataset".

Experiments

I found the article, 8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset super intersting and wanted to try some of the ideas there. In the Jupyter notebook, I try the following:

Try resampling your dataset
- Undersample not-problematic edits
- Oversample problematic edits
- The ratio need not be 1:1 for binary classification problems
Decision trees could perform well on imbalanced datasets.
penalized-SVM: Imposes additional cost on model for making classification mistakes on minority class during training.

I also found Tips on Practical Use on scikit-learn.org which has some interesting tips including the one below:

In SVC, if data for classification are unbalanced (e.g. many positive and few negative), set class_weight='balanced' and/or try different penalty parameters C.

From ^, I ran the following additional experiments:

Set class_weight='balanced'
Try different penalty parameters C

Workflow

Prepare the dataset for training the model and normalize it.
Train model on normalized dataset.
Using trained model, get predictions for the whole dataset.
Use the classification report as a measure of performance.

Results

After all this analysis, it was still the very first experiment, the vanilla SVC using all changesets for training that had the best performance metrics.

precision           recall  f1-score   support

problematic         0.98      0.03      0.06      5684
not problematic     0.91      1.00      0.95     53455

avg / total         0.91      0.91      0.87     59139

Resources

@anandthakker, would love to get your :eyes: on the notebook and jump on a call to discuss more.

bkowshik commented 7 years ago

In the dataset from osmcha used in the analysis, there are 53,556 changesets that are good and 5,691 changesets that are harmful. The ratio of imbalance is 53556:5691 or approximately 9:1.

bkowshik commented 7 years ago

We have moved over to a common notebook at the link below:

https://github.com/mapbox/gabbar/blob/master/notebooks/workflow.ipynb

mapbox / gabbar