mapbox / gabbar

Guarding OpenStreetMap from harmful edits using machine learning
MIT License
19 stars 7 forks source link

Understanding how to work with unbalanced datasets #17

Closed bkowshik closed 7 years ago

bkowshik commented 7 years ago

From Ian Vala on Quora,

You get 90% accuracy for your model and you are like "awesome!" until you find out, well 90% of the data was all on one class. This is called an "unbalanced dataset".

Experiments

I found the article, 8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset super intersting and wanted to try some of the ideas there. In the Jupyter notebook, I try the following:

I also found Tips on Practical Use on scikit-learn.org which has some interesting tips including the one below:

In SVC, if data for classification are unbalanced (e.g. many positive and few negative), set class_weight='balanced' and/or try different penalty parameters C.

From ^, I ran the following additional experiments:

Workflow

  1. Prepare the dataset for training the model and normalize it.
  2. Train model on normalized dataset.
  3. Using trained model, get predictions for the whole dataset.
  4. Use the classification report as a measure of performance.

Results

After all this analysis, it was still the very first experiment, the vanilla SVC using all changesets for training that had the best performance metrics.

precision           recall  f1-score   support

problematic         0.98      0.03      0.06      5684
not problematic     0.91      1.00      0.95     53455

avg / total         0.91      0.91      0.87     59139

Resources


@anandthakker, would love to get your :eyes: on the notebook and jump on a call to discuss more.

bkowshik commented 7 years ago

In the dataset from osmcha used in the analysis, there are 53,556 changesets that are good and 5,691 changesets that are harmful. The ratio of imbalance is 53556:5691 or approximately 9:1.

bkowshik commented 7 years ago

We have moved over to a common notebook at the link below: