Increase training size for feature level classifier

mapbox / gabbar

Guarding OpenStreetMap from harmful edits using machine learning

MIT License

19 stars 7 forks source link

Increase training size for feature level classifier #45

Closed bkowshik closed 7 years ago

bkowshik commented 7 years ago

Ref https://github.com/mapbox/gabbar/issues/43

We currently use 5,269 changesets for training our feature level classifier.
From changesets reviewed on osmcha with one feature modifications, it looks like we can potentially add upto 4,000 changesets.
This increase in the samples in the training dataset should in-turn improve the model.

Next actions

[ ] Update dataset with the additional 4,000 changesets - @bkowshik

cc: @batpad @geohacker

bkowshik commented 7 years ago

Curious to see the effect training size of the model has on the metrics, we have the following:

index-2

Notes / Questions

The metrics although diminishing have a significant positive slope.
If roc_auc score is 0.8 with 6,000 samples, what would it look like with 10,000 samples?
When do we know that we have enough samples?

cc: @anandthakker

bkowshik commented 7 years ago

Workflow

Set number of samples to use for the current run
Use only this subset of samples from the labelled training data
Train a model on this subset of training data
Get predictions from model for the entire validation dataset
Extract metrics on validation dataset
Increase number of samples to use for the next run and go again

bkowshik commented 7 years ago

Before we had 8,620 labelled samples out of which 6,036 was used for training and 2,584 for validation. With the backfill done, we now have 10,165 out out which we use 7115 for testing and 3050 for validation.

In total we added 1,545 new changesets to the labelled dump. 🎉

Interestingly, the nice upward graph now has become something like below. I don't understand why this is happening though.

index-2

We are 💯 to close here.