pelias / api

HTTP API for Pelias Geocoder
http://pelias.io
MIT License
221 stars 162 forks source link

Dedupe Geonames records with WOF concordances #1606

Closed orangejulius closed 2 years ago

orangejulius commented 2 years ago

This PR implements deduplication between WOF and Geonames when there's a Geonames concordance ID on the WOF record.

These concordances mean we can be fairly certain that the two records are the same, and that we don't even have to look at the name, or any other properties

This should be able to replace https://github.com/pelias/api/pull/1580, since the concordance method will work even for localities (there is special logic for locality/localadmin parent IDs for Geonames records from https://github.com/pelias/geonames/pull/93 that made https://github.com/pelias/api/pull/1580 less effective then it should be).

missinglink commented 2 years ago

Had a shower thought about this, it might be easier & more generic to do it like this:

At a later stage we could add some logic to virally merge concordances to make them work across 3+ degree connections but that's probably out-of-scope for now.

orangejulius commented 2 years ago

I've finished this off with a test, use of the pelias-model codec, and a different code style. It now avoids any use of let, or typeof. The structure is also simpler and not as deeply nested as all the checks now bail out early.

I like the strategy you've described for how we might generalize concordance checks in the future. It's something to keep in mind for sure. I don't think it really solves a problem we see today: the only possible deduplication between GN/OSM would be venues, and so few of them have matching Wikidata concordances it might not be worth it.

It also looks like Geonames records do not themselves have Wikidata concordances. So we would have to look up the data from Wikidata. For example, see La Sagrada Familia in OSM, Geonames, and Wikidata.