mapbox / gabbar

Guarding OpenStreetMap from harmful edits using machine learning
MIT License
19 stars 7 forks source link

Using reverted changesets for model training #66

Open bkowshik opened 7 years ago

bkowshik commented 7 years ago

Per text with @batpad,

Changeset comment has revert

There are a total of 13,125 changesets on osmcha with revert in the changeset comment. Interestingly, 2,505 (20%) changesets are one feature modification changesets which is what we use in the latest version of Gabbar.

Assuming, mappers revert a problematic or wrong feature in these one feature modification changesets, this could be an additional dataset we could make use of for the current iteration of the feature level classifier of Gabbar. I manually :eyes: a couple of these changesets and they are definitely want we want to catch with Gabbar.

screen shot 2017-06-15 at 7 20 14 pm screen shot 2017-06-15 at 7 23 52 pm

Changesets from revert user accounts

Mappers and DWG sometimes maintain a separate account for reverts. Changesets from these accounts will be interesting to look at as well. Ex:

screen shot 2017-06-15 at 7 27 42 pm

cc: @anandthakker @geohacker

bkowshik commented 7 years ago

Found 2604 changesets that had the geojson version of it in real changesets. Assuming all features in changesets with revert in the changeset comment are correcting a harmful change, I get changeset IDs of the previous version of all features in these changesets. Ex:

screen shot 2017-06-21 at 11 37 40 am

Following this workflow, I find a list of 14,062 unique changesets. Ideally, this is a list of changesets that had a problematic feature which was later reverted. The next step was to see what percentage of this was recent, (say in 2017) and have real changesets version so that we can use it as part of the training/validation dataset in Gabbar.

@manoharuss @krishnanammala, need your help here. Can you randomly :eyes: about 100 changesets from this list to see what percentage of the 100 are problematic. This will help us understand what to expect and if this can be used as training dataset in Gabbar.

Next actions


cc: @planemad

planemad commented 7 years ago

@bkowshik any changeset reverted by an experienced editor (>100edits) we can safely say was definitely a bad one. Lets use our time time more wisely to review only those that were reverted by a inexperienced user (<20 edits), this is where we might find some false negatives.

Other highly valuable questions to answer here:

cc @maning @batpad

bkowshik commented 7 years ago

Thank you @planemad, that was super helpful!

Reverting changesets

The CSV with 21 reverting changesets by new users is at the link below:

Yes, there is a correlation between the experience of the user and number of reverting changesets. Reverting changesets are way more likely from experienced users than new users.

index

What is more interesting is that user_mapping_days has a stronger correlation at 0.6 to number of reverting changesets in comparison to user_changesets with a correlation of 0.3. So, the mapping days of the user is a stronger indicator.

index

bkowshik commented 7 years ago

Reverted changesets

I couldn't resist finding who's changeset were getting reverted - the other side of the story.

The number of a users changesets getting reverted comes down as the user has more changesets, the user gains more mapping experience.

index

As expected, the user mapping days is negatively correlated, -0.3. Thus, higher a users mapping days, less likely of changeset being reverted.

index

bkowshik commented 7 years ago

Per https://github.com/mapbox/gabbar/issues/66#issuecomment-310029426

There are 21 reverting changesets by users less than 20 changesets. @manoharuss @krishnanammala can you please 👀 these and post notes about what percentage of this 21 are actually problematic?


cc: @planemad