pelias / whosonfirst

Importer for Who's on First gazetteer
MIT License
27 stars 43 forks source link

Ensure the placetype order when an import is done #471

Closed Joxit closed 4 years ago

Joxit commented 4 years ago

:wave: I did some awesome work for the Pelias project and would love for everyone to have a look at it and provide feedback.


Here's the reason for this change :rocket:

Some WOF data hierarchies were broken due to the import order and how resolveHierarchy work. If you have a record B which has in its hierarchy A and C. Then C will be missing because not yet imported.

This issue is linked to autocomplete only, because search use the hierarchy from placeholder and not ES.

For example, Tavel in France, when I import all France database, the hierarchy is :

   {
    "id": "101772677",
    "gid": "whosonfirst:locality:101772677",
    "layer": "locality",
    "source": "whosonfirst",
    "source_id": "101772677",
    "name": "Tavel",
    "accuracy": "centroid",
    "country": "France",
    "country_gid": "whosonfirst:country:85633147",
    "country_a": "FRA",
    "region": "Gard",
    "region_gid": "whosonfirst:region:85683431",
    "region_a": "GA",
    "locality": "Tavel",
    "locality_gid": "whosonfirst:locality:101772677",
    "label": "Tavel, France"
  }

Here's what actually got changed :clap:

I added an order by clause in the SQL statement (optional). It's use only when we do an import in the readStream, that meas wof admin lookup will not be impacted.

Now we have the correct result.

  {
    "id": "101772677",
    "gid": "whosonfirst:locality:101772677",
    "layer": "locality",
    "source": "whosonfirst",
    "source_id": "101772677",
    "name": "Tavel",
    "accuracy": "centroid",
    "country": "France",
    "country_gid": "whosonfirst:country:85633147",
    "country_a": "FRA",
    "macroregion": "Occitania",
    "macroregion_gid": "whosonfirst:macroregion:1108826387",
    "region": "Gard",
    "region_gid": "whosonfirst:region:85683431",
    "region_a": "GA",
    "macrocounty": "Nimes",
    "macrocounty_gid": "whosonfirst:macrocounty:404227861",
    "county": "Roquemaure",
    "county_gid": "whosonfirst:county:102068587",
    "localadmin": "Tavel",
    "localadmin_gid": "whosonfirst:localadmin:404414453",
    "locality": "Tavel",
    "locality_gid": "whosonfirst:locality:101772677",
    "label": "Tavel, France"
  }

Here's how others can test the changes :eyes:

There are some tests, If you remove the option in readStream you will see the issue. This will also slow down import time... :/ The import time for France has increased from 1m 25s to 2m 28s with this PR... :/

Joxit commented 4 years ago

I remove the ORDER BY query, this does not fit on world database. Instead I use the combined stream and filter on placetypes.

Joxit commented 4 years ago

Oh yes, and with the new change, it does not decrease import time on France imports and works better on world imports ! :tada:

orangejulius commented 4 years ago

Oops, we forgot about this PR which looks very nice! We will be doing a round of work with SQLite WOF import soon, and will take a look at this! Thanks for making this PR :)

orangejulius commented 4 years ago

Looks great, and this does indeed fix a bunch of issues when importing WOF data.

I made one tiny change to move the logger output from info level to debug, but otherwise this is great! thanks!