bkowshik commented 7 years ago

One of the popular problems in machine learning is dogs vs cats; given a picture predict whether the picture is of a dog or a cat. Coming from this initial experience about machine learning, I kept thinking the problem of classification of changesets as good or problematic is something similar. But, today I did an exercise where I wanted to identify one attribute about the changeset that makes it good or problematic. I started with:

https://osmcha.mapbox.com/49563062/
highway=residential is modified to highway=unclassified

The following questions came to mind

What could be the source of knowledge to modify?
Isn't residential better than unclassified; I mean something is better than nothing right?
At version 15, this is quite a mature feature. So, is that alright?
What is the length of the highway; smaller should be residential and longer unclassified?
Why is source=google maps Really?

From https://wiki.openstreetmap.org/wiki/Key:highway

highway=unclassified

The least most important through roads in a country's system – i.e. minor roads of a lower classification than tertiary, but which serve a purpose other than access to properties. Often link villages and hamlets.

highway=residential

Roads which serve as an access to housing, without function of connecting settlements.

From https://osmlab.github.io/osm-deep-history/#/way/103217436

The feature has mostly been highway=unclassified since creation in 2011.

Looking deeper into other changesets where a highway=residential gets modified into highway=unclassified, I find this user, Порфирий who has lots of changesets with the same behavior. Interestingly, the user who added highway=residential is Порфирий too.

https://www.openstreetmap.org/user/Порфирий/history

Eureka!

When a highway modification has so many questions to answer and attributes to look at, what will the scale be when we look at all 26 primary tags together? What about features that don't have any primary tags? Too many questions! Too many attributes! Right?

This does not look a traditional cats vs dogs. It is a little something else.
How about we try something different? How about we build one machine learning model for each object type?
How would it look when there is a model trained on highway's to classify whether the new/modified highway is a :thumbsup: or a :thumbsdown:
Another trained on buildings, another in water bodies, etc and each knew what a good highway looks like and a problematic highway looks like?
Is this it?

cc: @anandthakker @geohacker @batpad

bkowshik commented 7 years ago

In the dataset I had locally, found 36 changesets where highway=residential got modified to highway=unclassified. I 👀 a couple of these changesets.

https://gist.github.com/bkowshik/90e703ffd087c787636ad87eaa04c231

Notes

https://osmcha.mapbox.com/47392777/

https://osmcha.mapbox.com/48346176/
Unsure if this is unclassified or residential

https://osmcha.mapbox.com/48388526/
This should be a residential highway right?
Specially with the changeset comment "Add city roads"?

bkowshik commented 7 years ago

Attributes by action

There are 3 action types for a highway feature

A new highway is created
An existing highway is modified. Property and/or geometry modification
An existing highway is deleted

There are some attributes that are dependent on the action type. For example, the difference in length of highway is only for action modification; there are no two versions of the highway to calculate difference when it is newly created. Next, what attributes are relevant or not when a highway is deleted? I am 🤔 won't a length_difference column be redundant for a newly created highway?

I am not sure how to solve this problem, would love to hear ideas. But, for a start I am planning to add just the attributes in the latest version of the model along with the action in create, modify or delete. Let's see how this goes. If these attributes are not sufficient, we could add other diff attributes like difference in highway length, distance between the centroids, etc.

bkowshik commented 7 years ago

Very early results, 2 out of the 6 predicted in the sample are interesting.

https://osmcha.mapbox.com/48452572/
highway=residential goes inside a park

https://osmcha.mapbox.com/48299333/
Unusual rectangular shape of the highway

bkowshik commented 7 years ago

Highway classifier v1

Jupyter notebook with model training and testing

Dataset

Labelled samples: 2,732
Changesets labelled good: 2,655
Changesets labelled harmful: 77

Model

What did the model learn?

Table lists 10 attributes that the model thinks are the most important.

How are the model metrics?

With previous runs, I trained the model on the training dataset and measured metrics on the validation dataset. But, because of the narrow scope of the problem, we have samples on the lower side. Thus, I went the route of Cross Validation.

Precision: 10% (Fraction of changesets harmfu labelled problematic)
Recall: 20% (Fraction of harmful changesets predicted harmful)

Results

From among the unlabelled testing dataset of , 6 out of 344 were predicted to be problematic. The results are interesting indeed.

Model is learning that a highway=footway and area=yes don't exist together! :tada:

A demolished highway. Did not know something like that existed.

bkowshik commented 7 years ago

I experimented with scaling features using sklearn.preprocessing.StandardScaler

Without feature scaling

Precision on all samples: 0.037 (0.068)
Recall on all samples: 0.07 (0.131)

After feature scaling

Precision on all samples: 0.034 (0.048)
Recall on all samples: 0.052 (0.064)

Feature scaling does seem to have a small impact. Even through the mean scores come down, the standard deviation are down as well.

bkowshik commented 7 years ago

460 out of the total 2732 (17%) samples had a modification in name, which includes name additions, modifications and deletions. 22 of the 77 (28.57%) harmful changesets were name modifications. I added an attribute called feature_name_modified to see if that helps. The model put the feature_name_modified at the 5th position in the importance list.

The model metrics did not show a significant variation.

Precision on all samples: 0.058 (0.113)
Recall on all samples: 0.054 (0.092)

bkowshik commented 7 years ago

Error analysis

False negatives (14)

Harmful due to geometry: 3
Harmful due to feature name: 9
Fixme was removed: 1
highway=footway: 1

Feature is not good because of personal information in the name tag

True positives (43)

Highway classification modified: 18
Harmful due to feature name: 1
Some other feature made a highway: 1
Highway made a some other feature Ex: river: 13
Harmful due to geometry: 1
Some property of highway is modified Ex: oneway: 4

Harmful change when a highway feature becomes something else

bkowshik commented 7 years ago

The following gist has a random sample of 25 predictions from the first version of the highway classifier. The csv has both the changeset_id and feature_id.

@krishnanammala can you 👀 these changesets on osmcha and give me some feedback?

https://gist.github.com/bkowshik/16f1dc675d9a01e92cef6cee2569a2b9

cc: @planemad @batpad

krishnanammala commented 7 years ago

As per comment https://github.com/mapbox/gabbar/issues/69#issuecomment-312801138 above , I have gone through the changesets that are flagged by the Gabbar (Highway classifier). Here are my observations:

Total number of changesets reviewed: 26
No. of changesets found harmful : 2

The both harmful changesets are deletions of turn:lanes & lanes tags and both of them are from the same user.

I have outlined the detections in much clear way segregating them under Good detections and detections with less priority so that it helps @bkowshik getting more context in terms of improvement.

Good detections	detections with less priority
Deletion of area tags to highways	Geometry of highways changing
`Junction=roundabout` tag deleted	highways with `rest_areas` & `traffic signals` which are less priority
Classification of highways (higher -> lower) i.e., residential to unclassified	Addition of `layer` tags to minor highways i.e., service roads
Addition of `turn:lanes`	Addition and modification of low classification highways i.e., Tracks,paths,service roads

Hope the above observations will help you @bkowshik 👍

mapbox / gabbar

Prototyping Gabbar for highway features #69

Eureka!

Notes

Attributes by action

Highway classifier v1

Dataset

Model

What did the model learn?

How are the model metrics?

Results

Without feature scaling

After feature scaling

Error analysis

False negatives (14)

True positives (43)