pelias / wof-admin-lookup

Who's on First Admin Lookup for the Pelias Geocoder
https://pelias.io
MIT License
9 stars 24 forks source link

ensure parent endonyms exist for all countries and megacities #314

Closed missinglink closed 1 year ago

missinglink commented 1 year ago

This PR attempts to resolve a long-standing issue in Pelias where parent properties can only be specified in English (or in the 'default language').

For example querying for a country directly works fine, you can query for Germany, Deutschland or Allemagne to find Germany, the search logic usually targets the 'default language' and the target language of the User-Agent.

The issue is when using the country name in support of another query, such as the example 10 Torstraße Germany which works as expected, but the query 10 Torstraße Deutschland fails.

This is really not ideal since it's very English-centric, in this German example it's particularly odd that the official language of the country isn't supported but English is.

The reason for this dates back to the original schema design back in ~2014, where the parent properties weren't modelled with the idea of multiple languages like the name.* fields were, so it's been tricky to fix.

Coupled with that was the design of the PIP service and this repo wof-admin-lookup, the service is designed in such a way that it only ever loads and serves a single name for a place, changing this interface would be a breaking change that I don't have the bandwidth to tackle at the moment.

This PR provides some relief to the situation by providing dictionaries of Endonyms for countries and mega cities which will optionally be added as aliases to every record (under a pelias/config flag).

It's not clear at this stage what effect adding multiple aliases to half a billion records will have on the size of the index, performance and query quality, so for now I've pared it down to just countries and megacities.

In the future, depending on the success of this PR we can expand to cover Exonyms (likely only a subset of languages), however it may be preferable to reconsider the schema design at that point rather than clump all languages in the same field.


how it works:

missinglink commented 1 year ago

Couple of open questions:

missinglink commented 1 year ago

enabling this feature for openstreetmap and openaddresses (the vast majority of records in the index) resulted in a modest ~1% increase in the elasticsearch snapshot size:

Screenshot 2022-09-14 at 14 59 44
missinglink commented 1 year ago

This PR seems to be effective in resolving the issue and comes with negligible additional disk requirements ~1%:

Screenshot 2022-09-15 at 14 20 56

I'm happy to merge this, ideally we can pair it with a PR to the acceptance-tests repo to cover this feature.

missinglink commented 1 year ago

I spent some more time testing this today, it works great, but there's another class of problem I hadn't considered which can be resolved with the same method.

What I didn't realize is that the inverse of this issue is also a problem, where WOF uses the endonym as the primary label rather than English, which I had assumed to be a policy.

So for example I expected to find Köln with the wof:name of Cologne (ie. in English), which is the case with Germany for example, but this isn't universally true.

The issue in Pelias (autocomplete) is that you can find a record with "Domkloster 4 Köln" but not "Domkloster 4 Cologne", the inverse of the issue mentioned above.

The fix is very simple, actually I already wrote the code but had left it commented out: if (k === 'name:eng_x_preferred') { return true; }, this line means that the English name is always added as an alias.

The new commit https://github.com/pelias/wof-admin-lookup/pull/314/commits/1387a7552e05153dadcee598d5388bc520c0bbd5 shows the changes this line makes to the dictionaries.

I'll re-run the build and test again to ensure it's ready to merge

Joxit commented 1 year ago

Hi there,

This PR reminds me of another one I did a few years ago https://github.com/pelias/whosonfirst/pull/492 but I added all exonyms on WOF documents. The result was a bit disappointing for a world build orignal PR
Size 3,2G 48G
Time 41m6,438s 5h36m20,755s

Endonyms seems to be a good first step anyway :+1:

related: https://github.com/pelias/api/issues/1296

missinglink commented 1 year ago

This looks good, it adds about ~1% volume to the disk requirements and possibly some additional build time.

Since this is behind a feature flag and demonstrates that the test cases pass, I'm happy to squash-and-merge this.

There's still some opportunity to extend this PR in the future, since I know there's things other developers might want to add.