pelias / whosonfirst

Importer for Who's on First gazetteer
MIT License
27 stars 43 forks source link

[i18n] Add all hierarchy names #492

Closed Joxit closed 3 years ago

Joxit commented 4 years ago

Background

With pelias/api#1300 I added some cool feature for i18n autocomplete. But there are some limitations, such as the search of Parijs, Frankrijk, due to hierarchy index limitation. Since we have pelias/whosonfirst#446 and the support for multi-lang names property in OSM, we can search for both OSM venues/addresses and WOF by their non-default/English name.

This means that if we do an autocomplete of Parijs (with lang=nl and pelias/api#1300) we will get Paris :+1: but if you do an autocomplete of Parijs, Frankrijk we will get no results :-1:.

Why ?

Because the API will search a document with a name property Parijs AND in its hierarchy (country, locality...) Frankrijk. But in our ES index, we only have the default name in document hierarchy. So Frankrijk is not in the Parijs hierarchy... Only France is present.

How to get rid of this ?

We do not have many solutions... I found only 2.

  1. We add alternative names in our hierarchy to takes advantage of ES fuzzy search when we autocomplete things.
  2. We do 2 requests, my example is for the search of Pont Neuf, Parijs, Frankrijk:
    1. The first one is searching ES admin information, something like source=wof AND (name.lang='Parijs' or name.lang='Frankrijk') and take their ids.
    2. The second one is searching in ES Pont Neuf which is in the hierarchy found lately, something like name.lang='Pont Neuf' AND (country.id IN ('whosonfirst:country:85633147', 'whosonfirst:locality:101751119') OR locality IN ('whosonfirst:country:85633147', 'whosonfirst:locality:101751119'))

In this PR I tried the first solution and here are my results for France:

All names in parent.placetype

First try with all names in parent.placetype without duplicates.

master PR
Index Size 185M 1.1G
Index Time 86s 228s

This first test is not the right one in my opinion. I will try some alternatives :smile:

related: pelias/api#1296 related: pelias/api#1300

Joxit commented 3 years ago

Ok, I did some test yesterday with a full WOF only planet import, this is not convincing

master PR
Size 3,2G 48G
Time 41m6,438s 336m20,755s

I did another test today, save multi-lang names in a field ${placetype}_names but this causes an error

[400] type=illegal_argument_exception, reason=Limit of total fields [1000] has been exceeded

Defining too many fields in an index can lead to a mapping explosion, which can cause out of memory errors and difficult situations to recover from.

Consider a situation where every new document inserted introduces new fields, such as with dynamic mapping. Each new field is added to the index mapping, which can become a problem as the mapping grows.

From elastic documentation

Basically, this is not a good idea, so I will close this PR and find another solution

orangejulius commented 3 years ago

Ah yeah, adding this many new names probably won't work. There are also issues with field length that start happening with aliases (https://github.com/pelias/pelias/issues/862).

HOWEVER! It might be interesting for you to give it a shot on Elasticsearch 7.10: it has both faster indexing and more space efficient indices. It probably won't make much of a difference but I'm sure happy to be surprised. :)

Joxit commented 3 years ago

Yes you're right :disappointed:

I did my tests with Elasticsearch 7.10 :sweat_smile: