mapbox / gabbar

Guarding OpenStreetMap from harmful edits using machine learning
MIT License
19 stars 7 forks source link

Prototype an anomaly detection model for highways #88

Open bkowshik opened 7 years ago

bkowshik commented 7 years ago

Ref: https://github.com/mapbox/gabbar/issues/80 and https://github.com/mapbox/gabbar/issues/69

tumblr_inline_o6kjvapgbs1ta78fg_540

We all know labelled data is gold in machine learning land. But, in the context of OpenStreetMap and osmcha, there are two things:

1. Labelled harmful highways

On osmcha, labelling happens at changeset level. A changeset is either good or harmful. But, there are scenarios where not all features of a changeset are harmful. So, we should not assume all features of harmful changeset are harmful. In Gabbar, we worked with changesets where one feature was touched thus, if the changeset was good, the only feature was good and if the changeset is harmful, the only feature was harmful as there was only one feature in the changeset.

This worked ok for a generic classifier, but in the highway classifier, the size of the dataset is too low. For example, the latest highway classier was trained on 2217 good highways and a mere 55 harmful highways. Yes, the number of harmful highways is low. This means, supervised learning algorithms might not be fed enough to be strong and healthy.

2. Labelled good highways

But, we have an abundance (comparatively) of labelled highway that are good. The 2217 changesets from ^ are there but there is even more. When a changeset is labelled good, it is safe to assume all features in the changeset are good. Which in-turn means, all features in the changeset are good too including the highway features. Yay!

There are 50,000+ changesets labelled on osmcha and assuming every changeset has atleat one highway as highway are one among the frequently edited features on OpenStreetMap, we could potentially have around 50,000+ labelled good highways. This might be an interesting scenario to try anomaly detection models.

From https://en.wikipedia.org/wiki/Anomaly_detection

anomaly detection (also outlier detection) is the identification of items, events or observations which do not conform to an expected pattern or other items in a dataset.

Another potentially big advantage of anomaly detection models is that they flag when things are different than expected. This means, we are now not limited by the different types of harmful edits we have seen or given the model for training but in a way are ready for new and unknown types of anomalies. One important thing about anomaly detection is these models don't tell you whether a changeset is good or bad, they tell you if is something expected or something different.


cc: @anandthakker @geohacker @batpad

bkowshik commented 7 years ago

We have initial results from the anomaly detection model.

The following are results on the small validation dataset which includes:

Confusion matrix

Predicted harmful Predicted good
Labelled harmful 40 15
Labelled good 41 358

Classification report

                precision    recall  f1-score   support

        -1      0.49        0.73      0.59        55
        1       0.96        0.90      0.93       399

avg / total     0.90        0.88      0.89       454
bkowshik commented 7 years ago

Initial results

Anomaly detection algorithms won't tell you whether a feature or a feature modification is good or harmful. Instead, the models flag identify outliers, data points that are different in comparison to the rest of the sample set.

A highway now open after construction! 🎆

screen shot 2017-07-08 at 9 47 19 pm

Residential highway's don't tend to connect towns

screen shot 2017-07-08 at 10 02 32 pm

A highway=path eventually becomes waterway=river

screen shot 2017-07-08 at 10 12 58 pm