Supports for WOF SQLite database

Joxit commented 5 years ago

This PR add supports to SQLite databases. pelias/whosonfirst will still supports bundle, for sqlite import, a new options will be added : imports.whosonfirst.sqlite (false by default for backward compatibility).

TODO list:

[x] SQLite read stream
[x] List available databases
[x] Add imports.whosonfirst.sqlite configuration
[x] Add SQLite stream in import pipeline
[x] Download SQLite database for world utils/download_data_all.js
[x] Check performances between SQLite and bundle
[x] Expose SQLite Read Stream (for pelias/wof-admin-lookup)
[x] Cleanup SQLite (this can take some times)
[x] Supports importPlace

fixes #416

Joxit commented 5 years ago

I have a question, should a provide a script for database optimisation/size reduce ? Because when we download/build the database from WOF, we have tons of useless data and we can reduce the database size by removing useless entries in geojson/spr/ancestors and droping names and concordances. Maybe this is interesting only for those who make their own WOF database and can differ between use cases :confused:.

Joxit commented 5 years ago

I did some performance tests for SQLite versus meta/bundle on a world instance.... The fact is SQLite is faster than meta/bundle because bundle is limited by IO of tiny files.

When I use filtered queries (by placetypes), meta/bundle is faster than SQLites for placetypes from oceans (7 records in 7s vs 52s) to country (210 records 23s vs 67s). But SQLite is faster from region (4k records 96s vs 115) to locality (163k records 1973s vs 6109s)

Here is the full result

place type	size	time (meta/bundle)	time (SQLite)
ocean	7	3s	52s
continent	8	3s	54s
empire	11	5s	52s
dependency	40	8s	52s
macroregion	94	11s	57s
country	210	23s	67s
borough	241	13s	53s
marinearea	305	9s	52s
macrocounty	371	12s	54s
region	4k	115s	96s
county	41k	451s	147s
localadmin	112k	922s	150s
locality	163k	6109s	1973s

The benchmark was performed on a public cloud.

Joxit commented 5 years ago

Hi @missinglink @orangejulius,

I think this part is now OK

orangejulius commented 5 years ago

Hey @Joxit, I haven't reviewed all the code yet, but this looks really good overall. In trying it out I've had some download issues, always with the main sqlite database. Here's one example of an error:

done downloading whosonfirst-data-latest.db.bz2                                      
error downloading whosonfirst-data-latest.db.bz2Error: stderr maxBuffer exceeded     
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current      
                                 Dload  Upload   Total   Spent    Left  Speed        
 18 6454M   18 1199M    0     0   474k      0  3:52:16  0:43:10  3:09:06  481k

To my knowledge it wasn't caused by disk or RAM issues. The sqlite DB is bigger than any of the bundles, do you think there's a possibility of the existing download code having problems as filesizes grow?

I'll let you know how it looks after I've fully tested an import.

Joxit commented 5 years ago

Hey,

It's not the first time I see this error, it's related to node and child process. On a low connection or big file, curl do too much output for node and fulfill the buffer. We shoul invoke curl in silent mode and this will fix the issue (even for bundles).

orangejulius commented 5 years ago

Hey @Joxit, I was able to test out this code and it seems to work well!

I had to change the exec line in ./bin/start to exec node --max_old_space_size=4096 import.js in order to avoid memory limit issues, but it imports just fine.

I did notice it's quite a bit slower to import than the file based import (38 minutes vs 11 minutes on my system), but that's not a huge concern, especially considering the pip-service looks like it will actually load faster.

Joxit commented 5 years ago

Hey @orangejulius,

Yeah ! That's cool :smile: Change done !

Hum... That's a bit sad if SQLite import is slower than file base import.... :disappointed:

Joxit commented 5 years ago

@orangejulius did you see any differences in the number of imported elements ?

missinglink commented 5 years ago

heya, sorry it's taken so long for me to get to this, I'd like to push this forward and get it merged.

@Joxit do you have any other commits you'd like to add or is it good-to-go? Do you have any experience running it in a production setting?

The whosonfirst bundles are falling quite far out of date and the time is now right to drop support for them.

missinglink commented 5 years ago

I just tested this out on with pelias/docker for the portland-metro project and it ran for a really long time...

It seems as though the importPlace section is not being honoured by the whosonfirst importer in sqlite: true mode, I was importing the whole planet 😝

I'm guessing your intention is that the user should download the full planet sqlite database and then use the bin/sqlite-clean script to remove any superfluous rows from the database, reducing the filesize in the process?

If that's the case then we will need to modify bin/sqlite-clean to honour importPlace.

Although I would suggest a different approach...

We could instead simply filter out the superfluous records within the query clause itself.

The advantage would be that the sqlite database remains unmodified on disk, so developers can continue using it for debugging and for later expanding their geographic target area without re-downloading the massive db again each time.

This would also have the advantage of playing well with tooling that hashes the database when comparing against a copy stored on the whosonfirst.org servers.

I think in some production use-cases it might be beneficial to 'prune' the database using bin/sqlite-clean, so it's great to have around for that use-case if required, but for the general populous I think having more bytes on disk is a good tradeoff when they aren't yet sure of what data they may or may not need from the planet database.

Let me know what you think @Joxit, I know you've spent a bunch of time on this already, so I could do the dev work and you could do the review if that sounds preferrable to you?

Joxit commented 5 years ago

It's ok :wink:, thank you for your review !

Yes, we are using this in production since last month, especially for pip-service and it works like a charm.

Yes, I was targeting only full planet import with this PR, which need too many files. That why importPlaces is ignored :sweat_smile:

You are right, bin/sqlite-clean should honour importPlaces, I completely forgot this part. This can be also useful for small configurations (laptop).

I think filter the records in the query will increase the import time and will be slower than the file-based version. We should try :smile:

I will try to make the changes before I go to Japan at the end of the month.

Joxit commented 5 years ago

Hi @missinglink !

Now this PR supports importPlaces via sqlite import :grin:

missinglink commented 5 years ago

Thanks @Joxit I had some issues running it today but solved them all in https://github.com/pelias/whosonfirst/pull/424, could you please review and merge those commits in here?

I've made some changes to how the iterator works based off the docs:

... _read() should continue reading from the resource and pushing data until readable.push() returns false

Hopefully this results in a speed improvement, combined with increasing the highwaterMark will mean that 32 records will now be read from the database per tick of the event loop instead of reading 1 per tick.

missinglink commented 5 years ago

Something is broken with the Travis-CI integration, the tests are showing as passing here https://travis-ci.org/pelias/whosonfirst/builds/506783601

missinglink commented 5 years ago

I think this is good-to-merge now.

I like that this is disabled by default, but I'd also like to make this the new default setting once it's been tested thoroughly. For the time being, It's important that we continue to support backwards compatibility with the bundles and continue fixing any errors that arise relating to bundles.

After merging this I'm going to test it again and then I will gradually update the projects in pelias/docker to use config sqlite: true, the docker projects are fairly active and so we'll get feedback fairly quickly if something is not working correctly.

PR https://github.com/whosonfirst/go-whosonfirst-sqlite-features/pull/4 got merged using the same index name I chose, so the good news is that soon enough the canonical distributions will include this index by default, in the interim, there will be a first-access penalty incurred when generating the index in databases which don't have it.

Overall it's a really important addition to the codebase, so thanks @Joxit for all your work 🎉

orangejulius commented 5 years ago

Yeah, this is really awesome, and represents an important step forward in managing WOF data.

Two questions:

Is it still possible to override the location where downloads come from via the dataHost config option (last one in the list of configuration options)? This is important for those of us running services based on WOF data as it allows us to provide our own custom data patches without having to wait for the WOF team to merge
With all the performance increases from indicies, better stream handling, etc, what sort of loading performance do we see now? The most important metric is time to load a full set of WOF data into the PIP-service.

missinglink commented 5 years ago

@orangejulius good questions

A. This code doesn't include any download functionality, it just scans the $DATA_DIR/whosonfirst/sqlite directory for .db files which have previously been downloaded.

B. I can't give a definitive number for this yet, but we will certainly have to do some testing there before enabling sqlite: true in production environments.

I'm going to go ahead and merge this under the feature flag which disables it by default, we can then work towards our goal of switching from bundles to sqlite databases over the next few weeks as our confidence in it emboldens.

pelias / whosonfirst

Supports for WOF SQLite database #417