pelias / whosonfirst

Importer for Who's on First gazetteer
MIT License
27 stars 43 forks source link

Do not import both `whosonfirst-data-latest.db` and `whosonfirst-data-admin-$$-latest.db` #469

Closed slvlirnoff closed 4 years ago

slvlirnoff commented 4 years ago

From this page it seems that https://dist.whosonfirst.org/sqlite/ whosonfirst-data-latest.db hasn't been updated since May while many whosonfirst-data-admin-$$-latest.db have been updated several times.

Is there some information within whosonfirst-data-latest.db that is not within all of whosonfirst-data-admin-$$-latest.db?

From https://github.com/whosonfirst-data/whosonfirst-data repository README:

Disclaimer As of May 2019, the whosonfirst-data repository has split into per-country repositories. You can read more about that change here. While we still track all issues in this repository, the data itself will live in the per-country repositories for the foreseeable future.

Per-country repositories have the following repository naming convention:

whosonfirst-data-admin-{2-char country code}

With pelias/docker and for the whole planet using sqlite: true the scripts imports both data-latest and the various data-admin per country (using inventory.json).

This seems to cause an issue for Switzerland for instance: many layers have been fixed recently (localities/localadmin https://github.com/whosonfirst-data/whosonfirst-data-admin-ch/pull/4), since the importation order of the file is:

The old records are replacing the updated ones since they mostly share the same whosonfirst ids.

Related: https://github.com/pelias/whosonfirst/issues/437 Also probably related: https://github.com/pelias/pelias/issues/811

Best, Cyprien

orangejulius commented 4 years ago

Hi @slvlirnoff, Thanks for diving in and investigating. There's definitely a bug in the Pelias code here. The Pelias whosonfirst importer code predates the existence of the country-specific WOF sqlite databases, and the intent of the code was that only the global database would be downloaded.

So one fix would be to fix the bug and ensure only the global sqlite db is ever downloaded. I know the WOF team is working on a process to produce up to date global sqlite db, so it will be up to date again soon. This approach would probably be the easiest and require changing the least amount of Pelias code.

The other approach would be to start using the country sqlite dbs, since they are more up to date and many people have requested a way to import data for only specific countries without having to download the whole planet.

We should probably make the easy fix i mentioned first, to get things working as expected, before embarking on the longer term job of incorporating the country specific dbs.

NickStallman commented 4 years ago

My 2c would be for option 2 since it is cleaner and more versatile and doesn't rely on WOF to complete their planet db files again.

Looks like bundleList.js:getDBList simply needs to filter out the "latest" file, and an additional filter could be added there to specify specific country codes from the config file. Then the filters can simply be added to the download script as well.

missinglink commented 4 years ago

:+1: I would also love it if we could support the new country-specific sqlite database (and also the planet database) indifferently.

Julian and I have a full calendar of work up till the end of the year but I'd be happy to support the work (through review and discussion) if someone else were to author it.

orangejulius commented 4 years ago

Actually yes, I take it back. Moving to support the country DBs might be better. I didn't realize data from them is actually imported.

We would have to do some work for the PIP service, since it's currently hardcoded to use the global sqlite DB. On the other hand, the spatial service is on the horizon.

orangejulius commented 4 years ago

Okay, I've been taking a look at SQLite based builds so that we can ensure Pelias works out of the box with up to date WOF data, avoiding issues such as those in https://github.com/pelias/docker/issues/141.

At this point I'm pretty convinced we need to do both of the following things:

If anyone has any thoughts on this, let me know. Looking especially to hear from @missinglink and @Joxit

missinglink commented 4 years ago

It's no secret that I'm a big SQLite fan 😄 so I would like to:

This will clean up a lot of the code which has been in place since the inception of the WOF project and during its early development. In fact, this is nothing new, we've been actively working towards this goal for over a year now!

Great let's do it!

warning: this is one of my ranty comments, so maybe fetch a ☕️ if you'd like to read on?

What's the backstory?

The original intention was to continue using the full planet file (the only one available at the time) and switch over to imports.whosonfirst.sqlite=true as a fairly painless transition for everyone.

During this time the WOF team introduced country-specific SQLite distributions in order to address emails from Github which forced them to reduce the size of the main whosonfirst-data repo.

At this time there were large changes required by the WOF team to split the data into multiple repos, a bunch of changes were made to the bundling code which unfortunately resulted in errors and omissions still being investigated and resolved today.

What's the current state of play?

We've been chatting with the WOF team throughout and I've highlighted some issues with the per-country bundles.

The planet-wide bundle also hasn't been updated since 2019-05-06 and so doesn't reflect the current state of the GitHub sources.

There are two different full-planet SQLite databases kicking around for testing, one produced by me by merging all the per-country repos together and one by Stephen Epps which is produced directly from the filesystem without any merging.

My one isn't usable due to the 'per-country' repo bug linked above and @orangejulius has noticed some issues with the one Stephen produced (Julian could you please add more info so I can confirm these?).

Either way, there is currently no planet-wide SQLite database available from https://dist.whosonfirst.org which contains data since May 2019. That one also contains some errors (particularly AU state abbreviation issues) which have since been resolved.

So what can we do about it?

Honestly, not a lot right now.

For Geocode Earth we are running older, trusted data and applying patches where necessary.

I imagine others are doing this too and it's really not anyone's first choice as we'd like to be pulling in all the changes from the last 6 months rather than having to manually curate our own 'fork' of the canonical WOF data.

If the SQLite distributions were all working again what could we do?

If those issues are resolved we could switch over to using imports.whosonfirst.sqlite=true and add support for per-country distribution files.

As Julian noted we have not done any work to support country-specific distributions AFAIK, so this would need to happen.

Part of that work will require some understanding of how the country-specific distro files work, specifically the whosonfirst-data-admin-xx and whosonfirst-data-admin-xy repos will be required for installations requiring data above the specified country-code (such as continent, ocean, empire etc.)

One potential issue to consider is that the default compile-time options for SQLite limits the total amount of attached databases (SQLITE_MAX_ATTACHED) to 10, this can be raised to 125 but I'm not sure how easy that would be to achieve in practice as it may require a lengthly compile step on each npm install (including packages depending on the pelias/whosonfirst module).

This may or may not be an issue, it depends on how we implement it, if we're doing a streaming import then it can be avoided by only having one database open at a time, but if we wanted to do a complex join operation or resolve an inter-country hierarchy traversal query then this would be a constraint.

I think that's the glorious future of WOF support in Pelias, but requires a few data fixes and a few code updates to get us there, if we can achieve that then it will come with the benefit that building smaller 'region-specific' Pelias builds (such as the pelias/docker projects) will not require downloading the massive planet file, making it much quicker and easier to get started.

missinglink commented 4 years ago

I'm going to do some work to move this forward:

Joxit commented 4 years ago

I imagine others are doing this too and it's really not anyone's first choice as we'd like to be pulling in all the changes from the last 6 months rather than having to manually curate our own 'fork' of the canonical WOF data.

For Jawg Maps, I'm building the SQLite from GitHub repositories since December 2018 (because of whosonfirst/go-whosonfirst-dist#9). I thought using their bundles was too risky, so I chose to make mine from their sources. Now I'm updating my SQLite every 2-3 months.

This may or may not be an issue, it depends on how we implement it, if we're doing a streaming import then it can be avoided by only having one database open at a time, but if we wanted to do a complex join operation or resolve an inter-country hierarchy traversal query then this would be a constraint.

Definitely agree, doing a full planet import with the per country SQLite can be complex. Hum... Maybe not as complex as I thought, #471 highlights the fact that the placetype order is important. Only whosonfirst-data-admin-xx and whosonfirst-data-admin-xy databases are at the top of the other countries. So we could import these two databases and then all other countries. Doing this should avoid hierarchy issues.

The per country database will not be an issue for pelias/wof-admin-lookup.

orangejulius commented 4 years ago

I just opened #477 which will filter out the country-specific SQLite DBs (super easy to do thanks to @joxit's excellent code from #417).

While it's probably not what we want long term, it does help resolve the issue of downloading essentially 2 full copies of WOF data.

missinglink commented 4 years ago

Geocode Earth are now providing distribution files at https://geocode.earth/data/whosonfirst