Feature level classifier in Gabbar

bkowshik commented 7 years ago

Gabbar has traditional been a changeset level classifier. Which means, given a changeset ID, Gabbar extracts features at the changeset level to predict if the changeset is harmful or not. Let's try a feature level classifier as part of Gabbar.

Why feature level classifier?

On osmcha, users review changesets and labelled them as either good or harmful. This is a little to binary for a machine learning model. Question arises that when a changeset is labelled harmful, does that mean all features touched in the changeset are harmful?
We have accurate information at the feature level on why a feature modification is a 👍 or a 👎 which gets generalized at the changeset level.

Feature level dataset

Thanks to osmcha's filters, we can filter out changesets reviewed with the maximum number of features created, modified and deleted being one or less.

Looks like there are 14,314 changesets. Yay!!!

One feature	Number of changesets reviewed	Harmful changesets
Created	3,333	413
Modified	9,727	2,264
Deleted	321	20

cc: @batpad

bkowshik commented 7 years ago

One feature modification classifier

From osmcha, we can see that on average:

26,000 changesets are created everyday
4,000 changesets are just one feature modification (15% of changesets)

From the changesets manually labelled on osmcha:

There are a total of 9,030 harmful changesets
Interestingly, 2,610 of those are one feature modifications (30% of harmful changesets)

Thus, in the first iteration of the feature level classifier, I will focus on these changesets that have one feature modified only. A classifier with a good detection rate should help up identify 30% of the total harmful changesets. :crossed_fingers:

bkowshik commented 7 years ago

Model parameter tuning

The roc_auc metric for the default GradientBoostingClassifier model was 0.86
Parameter tuning using GridSearchCV bumped that up to 0.95 (10% improvement)
Model parameter tuning is great! 🎉

bkowshik commented 7 years ago

Metrics on validation dataset

roc_auc: 0.69

	Predicted good	Predicted harmful
Labelled good	223	232
Labelled harmful	40	54

             precision    recall  f1-score   support

          0       0.85      0.49      0.62       455
          1       0.19      0.57      0.28        94

avg / total       0.74      0.50      0.56       549

bkowshik commented 7 years ago

Jupyter notebook with analysis is at the link below:

notebooks/feature-classifier.ipynb

bkowshik commented 7 years ago

Manually reviewed 20 changesets that were labelled harmful but predicted to be not harmful.

Validation dataset - Review - Gabbar

Notes

We need similar attributes from the old version as well for the following:
- feature_name_naughty_words_count
- feature_primary_tags
- feature_property_tags
- feature_name_translations_count
- feature_has_website
- feature_has_wikidata
- feature_has_wikipedia
- aeroway_old, building_old, etc to capture primary tags in old version
aerialway_modified, building_modified, etc to capture value modifications
- aeroway take represent primary tags in the new version
- aeroway_old represent primary tags in the old version
- aeroway_modified represent modification of the primary tag values
How popular are values for primary tags from TagInfo
- Mean of primary tag value popularity
- Standard deviation of primary tag value popularity
- osm-compare/common_tag_values from @amishas157
Number of similar tags. Ex: leisure, leisure_1, leisure_2, etc
Number of similar tags in the old version
Special characters in changeset comment

bkowshik commented 7 years ago

Things did improve after adding all the features ^

roc_auc: 0.77

	Predicted good	Predicted harmful
Labelled good	382	67
Labelled harmful	62	26

             precision    recall  f1-score   support

          0       0.86      0.85      0.86       449
          1       0.28      0.30      0.29        88

avg / total       0.77      0.76      0.76       537

Variation

	Predicted good	Predicted harmful
Labelled good	+71%	-71%
Labelled harmful	+55%	-52%

NOTE: A positive percentage denotes an increase in numbers in comparison to the previous run while a negative percentage denotes a decrease.

bkowshik commented 7 years ago

Progress metrics on the validation dataset

After adding features

Percentage of harmful changesets predicted harmful: (26 * 100) / (26 + 62) = 30%
Percentage of predicted harmful changesets actually harmful: (26 * 100) / (26 + 67) = 28%

Before adding features

Percentage of harmful changesets predicted harmful: (54 * 100) / (54 + 40) = 57%
Percentage of predicted harmful changesets actually harmful: (54 * 100) / (54 + 232) = 19%

cc: @batpad

bkowshik commented 7 years ago

@manoharuss @krishnanammala Following is a csv file with 50 changesets predicted by Gabbar to be problematic and 50 changesets predicted by Gabbar to be good, a total of 100 changesets.

Manual review predictions - Gabbar

Can you please review these changesets on osmcha and label them as usual with a 👍 or 👎

If chageset is already reviewed, you can skip.
I will pull in the stats from osmcha so you need not make edits to the csv file
Some qualitative feedback on the quality and severity of detection's will be very helpful.

My expectation based on model metrics :crossed_fingers:

15 or more changesets are actually harmful among the 50 changesets predicted harmful
7 or less changesets are actually harmful among the 50 changesets predicted good

cc: @planemad

krishnanammala commented 7 years ago

@manoharuss & I shared the 100 changesets into half and I took the first 50 changesets and reviewed: Here 👇 are my observations:

Most of the changesets are from one user: https://www.openstreetmap.org/user/%D0%9F%D0%BE%D1%80%D1%84%D0%B8%D1%80%D0%B8%D0%B9. Who is adding old_name tag to the highways.
Not new changesets were flagged, changesets are mostly modifications and new additions to the already existing ones . For example: buildings -> building:levels, name tags for highways, conversion to parks from random tagging, adding surface tags to highways.
Only one bad changeset was flagged like added area tag and highway tag together , commented on the changeset.

cc @bkowshik

bkowshik commented 7 years ago

NOTE: Following are number for the 100 changesets dump.

Thank you @manoharuss and @krishnanammala. We do have quite a long way to go. 😞

Confusion matrix

	Predicted good	Predicted harmful
Labelled good	46	49
Labelled harmful	3	1

Learnings

Briefly analyze changesets for review for
- Is there a bias to changesets from a single user. Ex: Порфирий
- Is there a bias to some kind of operation, Ex: Adding old_name to feature.
For the buildings -> building:levels usecase, the model is trained on a dataset that labels a changeset as harmful if it has landuse, landuse_1, basically duplicate looking tags. That might a reason for predicting building, building:levels as harmful.

bkowshik commented 7 years ago

@anandthakker and me had a great discussion on the latest version of Gabbar and it's predictions.

Jupyter notebook with detailed analysis

On training dataset

Model does a great job on the training dataset. 🎉
This means, this model is capable of the classification task and :crossed_fingers: do it well.
roc_auc: 0.88

	Predicted good	Predicted harmful
Labelled good	4850	5
Labelled harmful	0	437

On validation dataset

roc_auc: 0.84

	Predicted good	Predicted harmful
Labelled good	2159	71
Labelled harmful	43	166

@manoharuss and @krishnanammala, we are good for the second round of 👀 from you. In the following csv, the sheets, 2017-06-12 (Mon) has 50 changesets predicted problematic and 50 changesets predicted good by the latest model in Gabbar.

Manual review predictions - Gabbar

Can you please review these changesets on osmcha and label them as usual with a 👍 or 👎

If chageset is already reviewed, you can skip.
I will pull in the stats from osmcha so you need not make edits to the csv file
Some qualitative feedback on the quality and severity of detection's will be very helpful.

bkowshik commented 7 years ago

@manoharuss @krishnanammala I had missed posting back learnings from the review you did last time. Posting here with additional details.

Adding old_name to features

In the dataset, there are a total of 27 changesets by user Порфирий where a feature got an old_name. The old model had flagged a majority of them as problematic. With the new model, we are getting:

Predicted good: 25
Predicted harmful: 2 (Changesets: 48299128, 48310183)
Percentage harmful: 7.41%

I don't see anything that stands out with the 2 changesets flagged harmful. I guess this is what we get with the current set up.

Duplicate looking tags

The idea was to changesets when features had both the tags, building and building_1 Ex: https://osmcha.mapbox.com/48444157/ But, the way I calculated the duplicate count resulted in some side-effects the results of which, the following were flagged :thumbsdown: as well:

emergency and emergency_service Ex: https://osmcha.mapbox.com/48328493/
building and building:levels Ex: https://osmcha.mapbox.com/48455150/
public_transport and public_transport:version Ex: https://osmcha.mapbox.com/48461528/

Will think of a workaround for this.

bkowshik commented 7 years ago

Thank you @manoharuss and @krishnanammala action now at: https://github.com/mapbox/gabbar/issues/69

mapbox / gabbar