Closed bkowshik closed 7 years ago
So, how does the performance fare when compared to manual reviews on osmcha?
471
changesets were predicted harmful to 339
changesets reviewed harmful.3575
changesets were predicted not harmful to 3707
changesets reviewed not harmful.Predicted harmful | Predicted not harmful | |
---|---|---|
Reviewed harmful | 258 | 81 |
Reviewed not harmful | 213 | 3494 |
This was super helpful to get the workflow
notebook. Closing in favor of https://github.com/mapbox/gabbar/pull/24
Real changesets are amazing! :boom: They have both the new and old versions of all features in the changeset as JSON. I guess, we could not have asked for anything more! Spectacular work @geohacker and @batpad. Thank you. :smiley:
In this PR, I explore how we could build a machine learning model using changeset data from real changesets and manual labelling of whether a changeset is harmful or not from osmcha.
Approach
sklearn.svm.SVC
as standard classification algorithm.1000
changesets5000
changesets.Features
I could extract
46 features
using a variety of data sources. This was great learning to engineer these features, train a new model and visualize to see how these performance parameters changed.User based
Feature based
Changeset based
Data and ML Model
4,046
changesets,339
harmful and3,707
not harmful.2,832
samples and tested on1,214
samples.GridSearchCV
for parameter tuning.Results
0.85
cc: @anandthakker