bkowshik commented 7 years ago

Ref: https://github.com/mapbox/gabbar/issues/69

In the field of Natural Language Processing (NLP), the Bag of Words technique is a popular one. Basically, text is represented as a bag of words, disregarding grammar and even word order but keeping multiplicity.

https://en.wikipedia.org/wiki/Bag-of-words_model

Something on these lines is the concept of Bag of Tags. All property tags from all samples in the training dataset for the Bag of Tags. Ex:

NOTE: harmful=0 represents a good changeset and harmful=1 a problematic changeset.

Changeset	highway	name	oneway	surface	maxspeed	vehicle	...
47514474	1	1	1	1	0	0	...
46429851	1	1	0	0	0	0	...
47349936	1	1	1	1	1	1	...

We collect all tags from changesets labelled with a :thumbsdown: and OneHotEncode them. Then, use this as attributes to train a classifier to learn and predict if changesets are good or problematic based on the occurrence of tags.

cc: @anandthakker @geohacker @batpad

bkowshik commented 7 years ago

Dataset

NOTE: Changesets that satisfy the following rules will be used in the first iteration.

Changeset where either one feature is created or one feature is modified
The feature has a highway tag among other tags
If the feature is modified, it is a property modification and not a geometry modification

In the dataset, the following satisfy the above conditions:

2,294 labelled changesets to use for training and validation
6,352 unlabelled samples to use for testing

bkowshik commented 7 years ago

Traditionally, I have been using 💯 of the datasets. This as meant using all the good changesets which has a ratio of 10:1 to harmful changesets. Because, detecting harmful changesets is the priority, we could throttle the number of good changesets in the training sample. I could see the following trend with a simple DecisionTreeClassifier.

Training good	Training harmful	True positive	False positive	True negative	False negative
32	32	6	186	491	6
64	32	6	157	520	6
160	32	3	56	621	9
320	32	3	31	646	9
640	32	3	17	660	9

NOTE: True positives are changesets both labelled and predicted harmful

bkowshik commented 7 years ago

Workflow

It is OK to get wrong on something the classifier will not be trained on. Ex: Feature name modification
For the others, ask the question - Adding what attribute would give this context to the classifier?