mapbox / gabbar

Guarding OpenStreetMap from harmful edits using machine learning
MIT License
19 stars 7 forks source link

Bag of Tags #74

Closed bkowshik closed 7 years ago

bkowshik commented 7 years ago

Ref: https://github.com/mapbox/gabbar/issues/69

In the field of Natural Language Processing (NLP), the Bag of Words technique is a popular one. Basically, text is represented as a bag of words, disregarding grammar and even word order but keeping multiplicity.

Something on these lines is the concept of Bag of Tags. All property tags from all samples in the training dataset for the Bag of Tags. Ex:

NOTE: harmful=0 represents a good changeset and harmful=1 a problematic changeset.

Changeset harmful highway name oneway surface maxspeed vehicle ...
47514474 0 1 1 1 1 0 0 ...
46429851 0 1 1 0 0 0 0 ...
47349936 0 1 1 1 1 1 1 ...

We collect all tags from changesets labelled with a :thumbsdown: and OneHotEncode them. Then, use this as attributes to train a classifier to learn and predict if changesets are good or problematic based on the occurrence of tags.


cc: @anandthakker @geohacker @batpad

bkowshik commented 7 years ago

Dataset

NOTE: Changesets that satisfy the following rules will be used in the first iteration.

In the dataset, the following satisfy the above conditions:

bkowshik commented 7 years ago

Traditionally, I have been using 💯 of the datasets. This as meant using all the good changesets which has a ratio of 10:1 to harmful changesets. Because, detecting harmful changesets is the priority, we could throttle the number of good changesets in the training sample. I could see the following trend with a simple DecisionTreeClassifier.

Training good Training harmful True positive False positive True negative False negative
32 32 6 186 491 6
64 32 6 157 520 6
160 32 3 56 621 9
320 32 3 31 646 9
640 32 3 17 660 9

NOTE: True positives are changesets both labelled and predicted harmful

bkowshik commented 7 years ago

Workflow

Priorities

screen shot 2017-06-22 at 1 14 43 pm screen shot 2017-06-22 at 1 38 57 pm

Training dataset

Changesets labelled :thumbsdown: but predicted :thumbsup:

screen shot 2017-06-22 at 12 51 18 pm

Changesets labelled :thumbsdown: and predicted :thumbsdown:

screen shot 2017-06-22 at 12 58 08 pm screen shot 2017-06-22 at 1 08 06 pm screen shot 2017-06-22 at 1 22 02 pm screen shot 2017-06-22 at 1 25 37 pm

Validation dataset

Changesets labelled :thumbsdown: but predicted :thumbsup:

screen shot 2017-06-22 at 2 57 43 pm