Using real changesets :boom:

bkowshik commented 7 years ago

Real changesets are amazing! :boom: They have both the new and old versions of all features in the changeset as JSON. I guess, we could not have asked for anything more! Spectacular work @geohacker and @batpad. Thank you. :smiley:

OSM diary post: Preparing accurate history and caching changesets

In this PR, I explore how we could build a machine learning model using changeset data from real changesets and manual labelling of whether a changeset is harmful or not from osmcha.

Approach

Use sklearn.svm.SVC as standard classification algorithm.
Start with a small datasets.
- Iteration 1 with 1000 changesets
- Iteration 2 with 5000 changesets.
Add features iteratively and measure performance

Features

I could extract 46 features using a variety of data sources. This was great learning to engineer these features, train a new model and visualize to see how these performance parameters changed.

User based

Number of user changesets
Number of user features

Feature based

Primary tags
- There are 26 primary tags on OpenStreetMap.
- Using the number of features of each primary tag touched in the changeset.
How many were property modifications
How many were geometry modifications
Feature types
- How many node | way | relation

Changeset based

Area of changeset bbox
Editor used to create the changeset
- Learnt how to vectorize a categorical features.
Sum of all feature versions
Features created
Features modified
Features deleted

Data and ML Model

The dataset currently has 4,046 changesets, 339 harmful and 3,707 not harmful.
The model was trained on 2,832 samples and tested on 1,214 samples.
Used GridSearchCV for parameter tuning.

Results

Cross validation score: 0.85
Classification report

                precision   recall      f1-score   support

False           0.93        0.90        0.91      1099
True            0.25        0.30        0.27       115

avg / total     0.86        0.85        0.85      1214

Learning curve:

index

cc: @anandthakker

bkowshik commented 7 years ago

So, how does the performance fare when compared to manual reviews on osmcha?

471 changesets were predicted harmful to 339 changesets reviewed harmful.
3575 changesets were predicted not harmful to 3707 changesets reviewed not harmful.

Confusion matrix

	Predicted harmful	Predicted not harmful
Reviewed harmful	258	81
Reviewed not harmful	213	3494

bkowshik commented 7 years ago

This was super helpful to get the workflow notebook. Closing in favor of https://github.com/mapbox/gabbar/pull/24

mapbox / gabbar