pelias / api

HTTP API for Pelias Geocoder
http://pelias.io
MIT License
221 stars 162 forks source link

Deduplicate Geonames 'City of' prefixes #1609

Closed orangejulius closed 2 years ago

orangejulius commented 2 years ago

A common cause of missed deduplication is Geonames locality/localadmin records that start with 'City of'.

Our name comparison logic is fairly conservative: it only looks at things like punctuation, diacriticals, etc. Otherwise, we have to consider names that are different meaning the underlying records represent genuinely different places.

Getting too far away from this general stance could be dangerous, but we can handle specific exceptions just fine.

Geonames records that start with 'City of' are one of these cases. Often, there is a Geonames locality record with just the name, (like 'New York'), and then a Geonames localadmin record with the 'City of' prefix. Usually only one of those records will have a WOF concordance, so this is still helpful even combined with https://github.com/pelias/api/pull/1606

missinglink commented 2 years ago

FYI there is some similar logic and IIRC tests too here https://github.com/pelias/placeholder/blob/master/lib/analysis.js#L87

orangejulius commented 2 years ago

Ah very nice. That logic is quite a bit simpler so I'll bring it into this PR.

I think it's ok to deduplicate across all of those differences in name, since things like county and locality will not (generally) be deduped since they have different layers (unless it hits one of the exceptions like one being a parent of the other).

orangejulius commented 2 years ago

I just realized this PR basically re-implements https://github.com/pelias/api/pull/1371. They solve the same problem and even in almost exactly the same way.

1371 is a bit more sophisticated, so I'm actually tempted to merge that one.

orangejulius commented 2 years ago

Closing in favor of #1371