mapbox / gabbar

Guarding OpenStreetMap from harmful edits using machine learning
MIT License
19 stars 7 forks source link

Prototyping Gabbar for highway features #69

Open bkowshik opened 7 years ago

bkowshik commented 7 years ago

One of the popular problems in machine learning is dogs vs cats; given a picture predict whether the picture is of a dog or a cat. Coming from this initial experience about machine learning, I kept thinking the problem of classification of changesets as good or problematic is something similar. But, today I did an exercise where I wanted to identify one attribute about the changeset that makes it good or problematic. I started with:

screen shot 2017-06-16 at 9 15 25 am

The following questions came to mind

From https://wiki.openstreetmap.org/wiki/Key:highway

The least most important through roads in a country's system – i.e. minor roads of a lower classification than tertiary, but which serve a purpose other than access to properties. Often link villages and hamlets.

Roads which serve as an access to housing, without function of connecting settlements.

From https://osmlab.github.io/osm-deep-history/#/way/103217436

screen shot 2017-06-16 at 9 19 59 am

Looking deeper into other changesets where a highway=residential gets modified into highway=unclassified, I find this user, Порфирий who has lots of changesets with the same behavior. Interestingly, the user who added highway=residential is Порфирий too.

screen shot 2017-06-16 at 9 30 27 am

Eureka!

When a highway modification has so many questions to answer and attributes to look at, what will the scale be when we look at all 26 primary tags together? What about features that don't have any primary tags? Too many questions! Too many attributes! Right?


cc: @anandthakker @geohacker @batpad

bkowshik commented 7 years ago

In the dataset I had locally, found 36 changesets where highway=residential got modified to highway=unclassified. I 👀 a couple of these changesets.

Notes

screen shot 2017-06-16 at 12 45 29 pm screen shot 2017-06-16 at 12 52 20 pm screen shot 2017-06-16 at 12 56 32 pm
bkowshik commented 7 years ago

Attributes by action

There are 3 action types for a highway feature

  1. A new highway is created
  2. An existing highway is modified. Property and/or geometry modification
  3. An existing highway is deleted

There are some attributes that are dependent on the action type. For example, the difference in length of highway is only for action modification; there are no two versions of the highway to calculate difference when it is newly created. Next, what attributes are relevant or not when a highway is deleted? I am 🤔 won't a length_difference column be redundant for a newly created highway?

I am not sure how to solve this problem, would love to hear ideas. But, for a start I am planning to add just the attributes in the latest version of the model along with the action in create, modify or delete. Let's see how this goes. If these attributes are not sufficient, we could add other diff attributes like difference in highway length, distance between the centroids, etc.

bkowshik commented 7 years ago

Very early results, 2 out of the 6 predicted in the sample are interesting.

screen shot 2017-06-18 at 12 19 01 am screen shot 2017-06-18 at 12 19 21 am
bkowshik commented 7 years ago

Highway classifier v1

Dataset

Model

What did the model learn?

screen shot 2017-06-23 at 6 27 13 pm

How are the model metrics?

With previous runs, I trained the model on the training dataset and measured metrics on the validation dataset. But, because of the narrow scope of the problem, we have samples on the lower side. Thus, I went the route of Cross Validation.

Results

From among the unlabelled testing dataset of , 6 out of 344 were predicted to be problematic. The results are interesting indeed.

screen shot 2017-06-23 at 5 46 04 pm screen shot 2017-06-23 at 5 39 42 pm
bkowshik commented 7 years ago

I experimented with scaling features using sklearn.preprocessing.StandardScaler

Without feature scaling

After feature scaling

Feature scaling does seem to have a small impact. Even through the mean scores come down, the standard deviation are down as well.

bkowshik commented 7 years ago

460 out of the total 2732 (17%) samples had a modification in name, which includes name additions, modifications and deletions. 22 of the 77 (28.57%) harmful changesets were name modifications. I added an attribute called feature_name_modified to see if that helps. The model put the feature_name_modified at the 5th position in the importance list.

screen shot 2017-06-28 at 3 57 37 pm

The model metrics did not show a significant variation.

bkowshik commented 7 years ago

Error analysis

False negatives (14)

screen shot 2017-06-30 at 9 40 21 am

Feature is not good because of personal information in the name tag

True positives (43)

screen shot 2017-06-30 at 10 05 52 am

Harmful change when a highway feature becomes something else

bkowshik commented 7 years ago

The following gist has a random sample of 25 predictions from the first version of the highway classifier. The csv has both the changeset_id and feature_id.

@krishnanammala can you 👀 these changesets on osmcha and give me some feedback?


cc: @planemad @batpad

krishnanammala commented 7 years ago

As per comment https://github.com/mapbox/gabbar/issues/69#issuecomment-312801138 above , I have gone through the changesets that are flagged by the Gabbar (Highway classifier). Here are my observations:

The both harmful changesets are deletions of turn:lanes & lanes tags and both of them are from the same user.

I have outlined the detections in much clear way segregating them under Good detections and detections with less priority so that it helps @bkowshik getting more context in terms of improvement.

Good detections detections with less priority
  • Deletion of area tags to highways
Geometry of highways changing
  • Junction=roundabout tag deleted
highways with rest_areas & traffic signals which are less priority
  • Classification of highways (higher -> lower) i.e., residential to unclassified
Addition of layer tags to minor highways i.e., service roads
  • Addition of turn:lanes
Addition and modification of low classification highways i.e., Tracks,paths,service roads

Hope the above observations will help you @bkowshik 👍