pelias / whosonfirst

Importer for Who's on First gazetteer
MIT License
27 stars 43 forks source link

Download sqlite default endpoint to geocode.earth and remove bundle support #487

Closed Joxit closed 4 years ago

Joxit commented 4 years ago

BREAKING CHANGE: drop support for wof bundles BREAKING CHANGE: imports.whosonfirst.sqlite default to true

Background

WOF database is hosted on geocode.earth, both per country and global admin are available, I'm updating the default endpoint in favor of geocode.earth.

What's new ?

A new config is available config.imports.whosonfirst.countries. This configuration take care advantage of the per country databases, it will allow you to download only the wanted countries. If the config is not present, it will download the global databases.

Minor fix introduced by #483 bunzip2 and not bunzip

Discussion

  1. In this PR, I did only the SQLite part, the bundle download is not compatible with geocode.earth. The old way was per placetype, now it's admin/postalcode/constituency. That's means old users with a custom download URL will have some issues. Breaking change
  2. importPlaces are very useful for small imports, now the per country is available, should we deprecate it ? Or use both countries and importPlaces ?
  3. importPlaces extract data in bundle format, should we deprecate this part to ? Deprecated
    • Pros:
      • Code less complicated
      • Easier to maintain (SQLite everywhere)
    • Cons:
      • Bundle format can be faster than SQLite at runtime...
  4. Should we remove the support of venue ?

related: pelias/whosonfirst#460 pelias/whosonfirst#469 pelias/whosonfirst#477 pelias/wof-admin-lookup#289

Fixes #469 Closes #460 Fixes #437

missinglink commented 4 years ago

FYI we don't currently build venue data, I could add that but I assumed it was old, unmaintained and noone was using it?

Joxit commented 4 years ago

I agree with you. I also think that nobody uses venue :man_shrugging:. I added venue in the white-list just in case.

Edit discussion: Should we remove venue ?

missinglink commented 4 years ago

Actually Julian and I discussed it on a call today.

I think WOF venues are 'soft deprecated', our experience with them is that the data is old and hasn't had new contributions for a long time.

As such we (Geocode Earth) will not be publishing dist files for WOF venues, the official dist.whosonfirst.org downloads are probably as up-to-date as they will ever be.

Regarding Pelias, it seems a pity to delete code which supports WOF venues since other developers may benefit from it, but if it becomes a maintenance burden then it may be ok to do so.

missinglink commented 4 years ago

So... Maybe we just have a generic warning message that the file you requested isn't available on the host you specified and leave it at that for now?

missinglink commented 4 years ago

If we want to be very thorough we could grab inventory files from both hosts and compare the last_modified time to find the best download.

Maybe this is overkill? But it would allow 'fallback' to dist.whosonfirst.org for files not published by GE

Joxit commented 4 years ago

Regarding Pelias, it seems a pity to delete code which supports WOF venues since other developers may benefit from it, but if it becomes a maintenance burden then it may be ok to do so.

You are right... So, if and only if venue are activated, we can add download it from WOF ? Maybe I have an elegant way to do this :thinking: I will try it tomorrow :)

Maybe this is overkill? But it would allow 'fallback' to dist.whosonfirst.org for files not published by GE

Yes this is overkill :sweat_smile:

Since we want to promote SQLite over Bundles, we can migrate to data GE only SQLite download part ? With some deprecation warnings ?

Joxit commented 4 years ago

Updated, I added the backward compatibility for venues via WOF

missinglink commented 4 years ago

Nice, I just checked on the build and it's finished, just syncing the files to the bucket, so it should be available shortly.

orangejulius commented 4 years ago

Cool, I'm currently testing this out (actually, on top of the minor fix in #484). It seems to work well so far but unfortunately we don't have a lot of automated testing around the download functionality :( :(

I think if we are going to merge this and default to Geocode Earth's WOF data (which I fully support), then we need to do one of two things:

The first of these would mean that we lose compatibility with the data on dist.whosonfirst.org, but it means we wouldn't have to add yet another backwards-compatibility complexity to the WOF build scripts. I understand they've already gotten quite complicated and it's actually really very hard to reliably and automatically build all the WOF data we need on a regular basis.

orangejulius commented 4 years ago

Also, I extracted the bugfix for the bunzip2 command into #488, so eventually we can rebase this against master and keep the changes in this PR limited to just what's needed :)

Joxit commented 4 years ago

With this PR, by default it will download whosonfirst-data-admin-latest.db from Geocode Earth. But this will fail if the source is something like WOF... They do not provide whosonfirst-data-admin-latest.db... Only per country extracts so this PR will break old builds :slightly_frowning_face:

What can we do ? Download all whosonfirst-data-(admin|postalcode|venue)-{country}-latest or whosonfirst-data-latest.db when a custom source is used ?

I have already sync the PR :wink:

Joxit commented 4 years ago

New commit to be compatible with WOF for full planet download

orangejulius commented 4 years ago

Hi! Big news today!

Who's on First has turned off their downloads at dist.whosonfirst.org. Screenshot_2020-04-17_10-55-54

I haven't been following recent changes to the WOF importer super closely, but that really simplifies things for this PR, doesn't it? There's now no need at all to worry about backwards compatibility with dist.whosonfirst.org.

Joxit commented 4 years ago

Oh! That's interesting !

Yes, this simplify a little bit the PR, but there are still users using custom URLs :disappointed:. One question remains, what is the future of venues ?

missinglink commented 4 years ago

We have no plans to publish venues distributions at this time.

I call a vote to remove support for WOF venues in order to simplify the code.

:+1: for yes :-1: for needs further discussion

pixeldublu commented 4 years ago

This should be pushed forward. Wof just disabled their downloads. But the changes from the pr are not enough. The structure from https://geocode.earth/ is different and lots of things are not working anymore...

Joxit commented 4 years ago

Hi @pixeldublu I started this PR in March and the WOF change was made last Friday.

This PR was originally made for SQLite download only. Are you using SQLite or bundle download ? Why this PR is not working for you ?

pixeldublu commented 4 years ago

For example venues are not available anymore... and i the bundles also not available anymore...

Joxit commented 4 years ago

Yes, check out our discussion here https://github.com/pelias/whosonfirst/pull/487#issuecomment-616404444 We are planning to remove the support of venues... Venues are not really updated by WOF team and seems to be outdated... That's why we want to remove this support.

If you need further discussion, fell free :slightly_smiling_face:

The bundle was out of the scope of this PR. I will add it later.

pixeldublu commented 4 years ago

I was trying to make individual db downloads last week and then this wof download disable happened :)

So i made some changes to my PR to use geocode earth and works but its not mergeable yet. Looking for improvement ideas.

Joxit commented 4 years ago

Big commit.... BREAKING CHANGE is coming !

I dropped the support for bundle, the importPlaces download case is no longer useful thanks to per country download, so I removed this code too. I removed unused dependencies. schema update: sqlite default to true !

orangejulius commented 4 years ago

Oh, importPlace is still quite useful, if you can keep it. It allows areas smaller than a country to be imported, reducing the number of records that aren't relevant for a specific use case.

Joxit commented 4 years ago

Only the download part has be deleted (download of the 40Go SQLites + generating bundle like architecture).

The importPlace feature is always be present for import and pip-service. Now, instead of generating bundle like architecture, we will load only interesting documents with this piece of code from #417:

https://github.com/pelias/whosonfirst/blob/d7f5b51d8804dd0d4ece601fa922ced544eda8cd/src/components/sqliteStream.js#L73-L91

That means it will work like a charm even for small areas :slightly_smiling_face:

For example, I want only Ile-de-France macroregion, the config will be:

{
  "datapath": "/opt/whosonfirst/",
  "importPostalcodes": true,
  "countries": "fr",
  "importPlace": 404227465,
  "sqlite": true
}

Inport time is 2 secondes and the last log is:

2020-04-21T22:42:31.990Z - info: [dbclient-whosonfirst]  paused=false, transient=0, current_length=0, indexed=5009, batch_ok=11, batch_retries=0, failed_records=0, localadmin=1277, country=1, macroregion=1, region=8, macrocounty=24, county=249, locality=1982, borough=20, macrohood=4, neighbourhood=914, postalcode=529, persec=500.9

Full France import time is 1min16s and the last log is:

2020-04-21T20:24:16.887Z - info: [dbclient-whosonfirst]  paused=false, transient=0, current_length=0, indexed=109398, batch_ok=219, batch_retries=0, failed_records=0, country=1, dependency=4, disputed=2, macroregion=13, region=101, macrocounty=328, county=3232, localadmin=35291, persec=889.8, locality=58987, borough=45, macrohood=4, neighbourhood=5336, postalcode=6054
orangejulius commented 4 years ago

Oh, duh, right! Import place is no longer needed at download time :)

I just tested it out and I think it works great. Thanks for taking care of this super fast! :speedboat:

Anything left or should we hit merge?

Joxit commented 4 years ago

importVenues, missingFilesAreFatal and importIntersections are still present in the schema. Should I remove them ?

https://github.com/pelias/whosonfirst/blob/d7f5b51d8804dd0d4ece601fa922ced544eda8cd/schema.js#L17-L34

sqlite is set to true, at import time, if this is not true I throw an exception, it's ok ?

https://github.com/pelias/whosonfirst/blob/d7f5b51d8804dd0d4ece601fa922ced544eda8cd/src/readStream.js#L52-L54

missinglink commented 4 years ago

Yes, I think if the code is not going to be able to support a configuration value then it should fail loudly and with a clear message so that users know how to fix their config. If a config migration is required we could have a page in pelias/config to help users do that without opening an issue.

Not sure if I brought this up before but the new variable in whosonfirst is called countries whereas the one in geonames is called countryCode. It's really not a big deal (and I don't really care that they are different) but I thought I'd mention it 🤷

missinglink commented 4 years ago

Sorry for the nitpick but can we please change the error message to 'Bundles are no longer supported!'

missinglink commented 4 years ago

I've never been so pleased to see so many deletions ;)

There are still some venue specific code (and regex patterns) in utils/download_sqlite_all.js, can we remove these?

missinglink commented 4 years ago

Regarding the admin-xy repo that we talked about in the past, on reflection, I think this isn't strictly required for most installations as it includes planet continent timezone and empire records which the majority of users probably don't want in their index anyway?

There are definitely other placetypes in there but it's probably not worth forcing users to download admin-xy?

sqlite3 whosonfirst-data-admin-xy-latest.db

sqlite> SELECT placetype, COUNT(*) AS count FROM spr GROUP BY placetype ORDER BY placetype;
placetype|count
campus|13826
continent|8
county|34
empire|12
localadmin|1
locality|2033
marinearea|305
neighbourhood|1945
ocean|7
planet|1
region|23
timezone|376
missinglink commented 4 years ago

Can we also drop support for WOF intersections?

I don't recall anyone ever using that feature and the data is static, we have ways of generating intersection datasets from OSM for those who require something similar and I think it'll be more suitable.

Joxit commented 4 years ago

Hey ! New commit !

I feel that the intersections were already filtered :thinking:

missinglink commented 4 years ago

:tada:

Joxit commented 4 years ago

Well done everyone :smile: :rocket:

orangejulius commented 4 years ago

Well done mostly to you @Joxit for doing most of the work to start this PR and the one in Placeholder :)

I just had some fun adding countryCode to all our Docker projects and the fast, small downloads are so nice! :tada:

If there are any bugs they probably came in from all my rebasing, let us know :)