Closed bkowshik closed 7 years ago
NOTE: Changesets that satisfy the following rules will be used in the first iteration.
highway
tag among other tagsIn the dataset, the following satisfy the above conditions:
2,294
labelled changesets to use for training and validation6,352
unlabelled samples to use for testingTraditionally, I have been using 💯 of the datasets. This as meant using all the good changesets which has a ratio of 10:1
to harmful changesets. Because, detecting harmful changesets is the priority, we could throttle the number of good changesets in the training sample. I could see the following trend with a simple DecisionTreeClassifier
.
Training good | Training harmful | True positive | False positive | True negative | False negative |
---|---|---|---|---|---|
32 | 32 | 6 | 186 | 491 | 6 |
64 | 32 | 6 | 157 | 520 | 6 |
160 | 32 | 3 | 56 | 621 | 9 |
320 | 32 | 3 | 31 | 646 | 9 |
640 | 32 | 3 | 17 | 660 | 9 |
NOTE: True positives are changesets both labelled and predicted harmful
highway=footway
are problematic. So, we need good geometry attributes.name
of a highway is modifiedold_name
tagchinakz
highway=tertiary
to highway=footway
tourism=attraction
to a highway=traffic_signals
landuse=forest
to a highway feature is problematic.oneway=yes
-> oneway=no
is problematic
Ref: https://github.com/mapbox/gabbar/issues/69
In the field of Natural Language Processing (NLP), the Bag of Words technique is a popular one. Basically, text is represented as a bag of words, disregarding grammar and even word order but keeping multiplicity.
Something on these lines is the concept of
Bag of Tags
. All property tags from all samples in the training dataset for the Bag of Tags. Ex:NOTE:
harmful=0
represents a good changeset andharmful=1
a problematic changeset.We collect all tags from changesets labelled with a :thumbsdown: and
OneHotEncode
them. Then, use this as attributes to train a classifier to learn and predict if changesets are good or problematic based on the occurrence of tags.cc: @anandthakker @geohacker @batpad