Open bkowshik opened 7 years ago
We have initial results from the anomaly detection model.
The following are results on the small validation dataset which includes:
399
highways labelled good (potential inliers)55
highways labelled harmful (potential outliers)Predicted harmful | Predicted good | |
---|---|---|
Labelled harmful | 40 | 15 |
Labelled good | 41 | 358 |
precision recall f1-score support
-1 0.49 0.73 0.59 55
1 0.96 0.90 0.93 399
avg / total 0.90 0.88 0.89 454
Anomaly detection algorithms won't tell you whether a feature or a feature modification is good or harmful. Instead, the models flag identify outliers
, data points that are different in comparison to the rest of the sample set.
highway=path
eventually becomes waterway=river
Ref: https://github.com/mapbox/gabbar/issues/80 and https://github.com/mapbox/gabbar/issues/69
We all know labelled data is gold in machine learning land. But, in the context of OpenStreetMap and osmcha, there are two things:
1. Labelled harmful highways
On osmcha, labelling happens at changeset level. A changeset is either good or harmful. But, there are scenarios where not all features of a changeset are harmful. So, we should not assume all features of harmful changeset are harmful. In Gabbar, we worked with changesets where one feature was touched thus, if the changeset was good, the only feature was good and if the changeset is harmful, the only feature was harmful as there was only one feature in the changeset.
This worked ok for a generic classifier, but in the highway classifier, the size of the dataset is too low. For example, the latest highway classier was trained on
2217
good highways and a mere55
harmful highways. Yes, the number of harmful highways is low. This means, supervised learning algorithms might not be fed enough to be strong and healthy.2. Labelled good highways
But, we have an abundance (comparatively) of labelled highway that are good. The
2217
changesets from ^ are there but there is even more. When a changeset is labelled good, it is safe to assume all features in the changeset are good. Which in-turn means, all features in the changeset are good too including the highway features. Yay!There are
50,000+
changesets labelled on osmcha and assuming every changeset has atleat one highway as highway are one among the frequently edited features on OpenStreetMap, we could potentially have around50,000+
labelled good highways. This might be an interesting scenario to try anomaly detection models.From https://en.wikipedia.org/wiki/Anomaly_detection
Another potentially big advantage of anomaly detection models is that they flag when things are different than expected. This means, we are now not limited by the different types of harmful edits we have seen or given the model for training but in a way are ready for new and unknown types of anomalies. One important thing about anomaly detection is these models don't tell you whether a changeset is good or bad, they tell you if is something expected or something different.
cc: @anandthakker @geohacker @batpad