Ignore hierarchy of `neighbourhood` layer

orangejulius commented 6 years ago

After reading @NickStallman's analysis of admin areas in Australia, I wanted to do some investigation of my own.

Background

In https://github.com/pelias/wof-admin-lookup/pull/143 we pushed a big performance update to our admin lookup process. Basically, we perform point in polygon for the "lowest" layer (neighbourhood), and if there is a match ,and that WOF record has a valid hierarchy, we trust that hierarchy and avoid further PIP calculations.

Previously, we ignored the hierarchy and tested all other layers (locality, county, state, etc). This old approach had poor performance. It was doing a lot more work, and more particularly, every lat/lon had to do point in polygon calculations against the often very large country polygons, leading to a bottleneck.

We knew that it's possible for WOF hierarchies to be incorrect, but took the performance improvements as things generally seemed ok.

Testing

The hypothesis I wanted to test is: are the hierarchies of WOF neighbourhood records as correct for a given lat/lon as the hierarchies of a locality record.

I loaded a Pelias Point in polygon service with data for Australia and queried every single lat/lon from the G-NAF address dataset (through the Openaddresses au/countrywide.csv file). Twice actually.

Here's the code

Thanks to @burritojustice I was able to load 1 million of those points into a tile layer with HERE XYZ.

Here's what it looks like for Melbourne and Sydney:

Map link

The comparison is more or less simple string comparison between the G-NAF city name and the locality name from WOF.

The color codes are: green: Exact match for both the locality attached to the WOF neighbourhood at that point and the locality at that point blue The neighbourhood at that point has an incorrect locality (or no locality), but querying directly against the locality geometry returns the correct answer red: Both neighbourhood and locality are wrong(or at least not a simple text match) yellow: The locality geometry is inherently wrong, but somehow the neighbourhood hierarchy has it correct

Analysis

It would be interesting to overlay WOF geometries on this map, because it's clear there are patterns. It's also clear that "trusting" the neighbourhood's WOF hierarchy leads to lots of incorrect results.

We have known this for a while, in issues such as #156.

Going forward

In total, 9,663,424 (67%) of points tested have a matching locality in the WOF neighbourhood hierarchy. 12,394,471(86% ) of points tested have the locality correct.

This suggests that we can mostly, but not completely, fix admin area issues in Australia by checking the locality layer, even if we found a neighbourhood result. This would keep most of the performance gains, but give much better accuracy.

NickStallman commented 5 years ago

Great work! I'm actually a little surprised there are yellow/red non-matches so I've taken a bit more of a look at them.

Neighbourhood match - 9,696,791 (67%) Locality match - 12,433,816 (86%)

No PIP results at all for neighbourhood/locality - 1,674,278 (11.5%) It appears that WOF has some large holes in the data where neighbourhoods and localities are missing entirely. After doing some spot checks they appear to be mostly rural areas. If localadmin is also considered when no neighbourhood or locality is found then it will match a chunk of these. I tried geocoding some of these and it worked fine so it might be possible to ignore a lot of this category as it doesn't affect geocoding.

Completely different string - 331,495 (2.2%) Some of the non-matching strings are "Avalon" vs "Avalon Beach" so they are relatively minor differences. Others are plain wrong.

It looks like we can potentially raise the geocoding accuracy up to ~95% by fixing neighbourhoods issue for Australia.

orangejulius commented 5 years ago

This is done as of #246 and Australian results are looking quite good since.

pelias / wof-admin-lookup