pelias / wof-admin-lookup

Who's on First Admin Lookup for the Pelias Geocoder
https://pelias.io
MIT License
9 stars 24 forks source link

Enhance postal cities resilience against erroneous data #288

Closed orangejulius closed 4 years ago

orangejulius commented 4 years ago

As it stands now, the postal cities dataset can cause records to have invalid admin hierarchy if even a single record in OSM has an incorrect mapping from postal code to locality.

A good example of this is postal code 11215 in Brooklyn, NY, which currently shows up as part of Geneseo, NY, several hundred miles away.

/v1/search?text=111+8th+Avenue%2C+Brooklyn%2C+Geneseo%2C+NY%2C+USA Screenshot_2020-03-06_13-42-18 image

It turns out there is a single incorrect record in OSM with postal code 11215: Screenshot_2020-03-06_13-39-58

This is enough to introduce an incorrect mapping.

Possible solutions

I'm sure there are many things we can do here, and we might end up including several:

missinglink commented 4 years ago

IIRC there is a total number of occurrences in OSM preserved in our data for this purpose.

missinglink commented 4 years ago

Do we know how many people mapped this as 11215 = Geneseo?

orangejulius commented 4 years ago

Right, there is only one occurrence. Here's the relevant lines from the USA.tsv data file:

11215   421205765   Brooklyn        borough 130
11215   85978297    Geneseo     locality    1
11215   85977539    New York    NYC locality    1
missinglink commented 4 years ago

Yeah, a lone wolf, we should probably only load data for occurrences > x.

Where x is 10? Or 5?

orangejulius commented 4 years ago

Yeah, that seems like a good approach. Out of the 39585 line USA.tsv file, here's the breakdown of occurrence frequency:

awk -F '\t' '{ print $6 }' USA.tsv | sort -n | uniq -c | head -n 20
  11895 1
   4413 2
   2740 3
   2178 4
   1773 5
   1519 6
   1189 7
   1144 8
    782 9
    741 10
    596 11
    603 12
    459 13
    430 14
    363 15
    340 16
    273 17
    275 18
    212 19
    212 20

So for example, there are 11895 mappings with only one occurrence.

missinglink commented 4 years ago

Maybe we should make the lastline data an npm module so we don't need to copy the files here on every rebuild?

Just mentioning that because right now we're only using a small fraction of the lastline dataset.

orangejulius commented 4 years ago

Just wanted to write down some thoughts on different cases we might want to handle when dealing with errors in OSM data we use to derive postal cities data.

Multiple frequently seen values

When there are a reasonably high number of confirmations for two different mappings of a postal code to a city, we want to keep them both. The more popular should be used for display, which will hopefully be the correct one, but either way allowing searches on both to succeed is key.

A real world example of this is seen for Louisville, KY

postal code WOF ID Admin Name Admin Layer Count
40047 85947523 Louisville locality 14
40047 85946765 Mount Washington locality 13

Single unambiguous interpretation

If there's just one mapping from a postal code to a city, we probably want to keep it. As mentioned above there are quite a few of these, so we'd be throwing away essentially 1/3 of the mappings if we ignored this data.

Here's a real world example of a correct zip code mapping that only has a single occurrence in OSM postal code WOF ID Admin Name Admin Layer Count
48099 85951983 Troy locality 1

Multiple interpretations with outlier(s)

In the case where there are multiple interpretations and one or more of them are common, but there are outliers that are uncommon, we probably want to ignore the outliers. Another example from above:

postal code WOF ID Admin Name Admin Layer Count
11215 421205765 Brooklyn borough 130
11215 85978297 Geneseo locality 1
11215 85977539 New York locality 1

In this case, Brooklyn is the correct value. New York is technically incorrect, and Geneseo is completely wrong.

Summary

A strategy for handling all these would be a little more complicated that something as simple as "ignore rows with fewer than X occurrences" but would be very valuable. Anyone have thoughts on the parameters and strategy we would want to use?

orangejulius commented 4 years ago

I looked into this more, and I believe our logic for determining the best postal cities match is correct. We do prefer the most frequent value to use as the display name, and while occasionally erroneous data will make its way in, overall I think our current logic does a good job.

There's one exception, which is when looking at boroughs! I don't think we should replace a locality value after replacing a borough value. Looking at the table of values for US zip code 11215:

11215   421205765   Brooklyn        borough 130
11215   85978297    Geneseo     locality    1
11215   85977539    New York    NYC locality    1

Both the second and third columns are actually incorrect. The official postal city value for the zip code is the borough of Brooklyn, not the city of New York. That explains why there is only one instance of each value. Our current code is not very resilient against this, because it always looks to replace both the borough and the locality on a record if it can.

297 implements logic to avoid changing the locality value with postal cities data if the borough value was already changed, and ensures this invalid data is no longer an issue. It definitely resolves the particular problem in Brooklyn with zip code 11215 that caused us to open this issue, and I think once it's merged, it will mean that our existing logic is resilient enough that we don't need to change anything right now .:)