pelias / whosonfirst

Importer for Who's on First gazetteer
MIT License
28 stars 42 forks source link

Supports for WOF SQLite database #417

Closed Joxit closed 5 years ago

Joxit commented 5 years ago

This PR add supports to SQLite databases. pelias/whosonfirst will still supports bundle, for sqlite import, a new options will be added : imports.whosonfirst.sqlite (false by default for backward compatibility).

TODO list:

fixes #416

Joxit commented 5 years ago

I have a question, should a provide a script for database optimisation/size reduce ? Because when we download/build the database from WOF, we have tons of useless data and we can reduce the database size by removing useless entries in geojson/spr/ancestors and droping names and concordances. Maybe this is interesting only for those who make their own WOF database and can differ between use cases :confused:.

Joxit commented 5 years ago

I did some performance tests for SQLite versus meta/bundle on a world instance.... The fact is SQLite is faster than meta/bundle because bundle is limited by IO of tiny files.

When I use filtered queries (by placetypes), meta/bundle is faster than SQLites for placetypes from oceans (7 records in 7s vs 52s) to country (210 records 23s vs 67s). But SQLite is faster from region (4k records 96s vs 115) to locality (163k records 1973s vs 6109s)

Here is the full result

place type size time (meta/bundle) time (SQLite)
ocean 7 3s 52s
continent 8 3s 54s
empire 11 5s 52s
dependency 40 8s 52s
macroregion 94 11s 57s
country 210 23s 67s
borough 241 13s 53s
marinearea 305 9s 52s
macrocounty 371 12s 54s
region 4k 115s 96s
county 41k 451s 147s
localadmin 112k 922s 150s
locality 163k 6109s 1973s

The benchmark was performed on a public cloud.

Joxit commented 5 years ago

Hi @missinglink @orangejulius,

I think this part is now OK

orangejulius commented 5 years ago

Hey @Joxit, I haven't reviewed all the code yet, but this looks really good overall. In trying it out I've had some download issues, always with the main sqlite database. Here's one example of an error:

done downloading whosonfirst-data-latest.db.bz2                                      
error downloading whosonfirst-data-latest.db.bz2Error: stderr maxBuffer exceeded     
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current      
                                 Dload  Upload   Total   Spent    Left  Speed        
 18 6454M   18 1199M    0     0   474k      0  3:52:16  0:43:10  3:09:06  481k 

To my knowledge it wasn't caused by disk or RAM issues. The sqlite DB is bigger than any of the bundles, do you think there's a possibility of the existing download code having problems as filesizes grow?

I'll let you know how it looks after I've fully tested an import.

Joxit commented 5 years ago

Hey,

It's not the first time I see this error, it's related to node and child process. On a low connection or big file, curl do too much output for node and fulfill the buffer. We shoul invoke curl in silent mode and this will fix the issue (even for bundles).

orangejulius commented 5 years ago

Hey @Joxit, I was able to test out this code and it seems to work well!

I had to change the exec line in ./bin/start to exec node --max_old_space_size=4096 import.js in order to avoid memory limit issues, but it imports just fine.

I did notice it's quite a bit slower to import than the file based import (38 minutes vs 11 minutes on my system), but that's not a huge concern, especially considering the pip-service looks like it will actually load faster.

Joxit commented 5 years ago

Hey @orangejulius,

Yeah ! That's cool :smile: Change done !

Hum... That's a bit sad if SQLite import is slower than file base import.... :disappointed:

Joxit commented 5 years ago

@orangejulius did you see any differences in the number of imported elements ?

missinglink commented 5 years ago

heya, sorry it's taken so long for me to get to this, I'd like to push this forward and get it merged.

@Joxit do you have any other commits you'd like to add or is it good-to-go? Do you have any experience running it in a production setting?

The whosonfirst bundles are falling quite far out of date and the time is now right to drop support for them.

missinglink commented 5 years ago

I just tested this out on with pelias/docker for the portland-metro project and it ran for a really long time...

It seems as though the importPlace section is not being honoured by the whosonfirst importer in sqlite: true mode, I was importing the whole planet 😝

I'm guessing your intention is that the user should download the full planet sqlite database and then use the bin/sqlite-clean script to remove any superfluous rows from the database, reducing the filesize in the process?

If that's the case then we will need to modify bin/sqlite-clean to honour importPlace.

Although I would suggest a different approach...

We could instead simply filter out the superfluous records within the query clause itself.

The advantage would be that the sqlite database remains unmodified on disk, so developers can continue using it for debugging and for later expanding their geographic target area without re-downloading the massive db again each time.

This would also have the advantage of playing well with tooling that hashes the database when comparing against a copy stored on the whosonfirst.org servers.

I think in some production use-cases it might be beneficial to 'prune' the database using bin/sqlite-clean, so it's great to have around for that use-case if required, but for the general populous I think having more bytes on disk is a good tradeoff when they aren't yet sure of what data they may or may not need from the planet database.

Let me know what you think @Joxit, I know you've spent a bunch of time on this already, so I could do the dev work and you could do the review if that sounds preferrable to you?

Joxit commented 5 years ago

It's ok :wink:, thank you for your review !

Yes, we are using this in production since last month, especially for pip-service and it works like a charm.

Yes, I was targeting only full planet import with this PR, which need too many files. That why importPlaces is ignored :sweat_smile:

You are right, bin/sqlite-clean should honour importPlaces, I completely forgot this part. This can be also useful for small configurations (laptop).

I think filter the records in the query will increase the import time and will be slower than the file-based version. We should try :smile:

I will try to make the changes before I go to Japan at the end of the month.

Joxit commented 5 years ago

Hi @missinglink !

Now this PR supports importPlaces via sqlite import :grin:

missinglink commented 5 years ago

Thanks @Joxit I had some issues running it today but solved them all in https://github.com/pelias/whosonfirst/pull/424, could you please review and merge those commits in here?

I've made some changes to how the iterator works based off the docs:

... _read() should continue reading from the resource and pushing data until readable.push() returns false

Hopefully this results in a speed improvement, combined with increasing the highwaterMark will mean that 32 records will now be read from the database per tick of the event loop instead of reading 1 per tick.

missinglink commented 5 years ago

Something is broken with the Travis-CI integration, the tests are showing as passing here https://travis-ci.org/pelias/whosonfirst/builds/506783601

missinglink commented 5 years ago

I think this is good-to-merge now.

I like that this is disabled by default, but I'd also like to make this the new default setting once it's been tested thoroughly. For the time being, It's important that we continue to support backwards compatibility with the bundles and continue fixing any errors that arise relating to bundles.

After merging this I'm going to test it again and then I will gradually update the projects in pelias/docker to use config sqlite: true, the docker projects are fairly active and so we'll get feedback fairly quickly if something is not working correctly.

PR https://github.com/whosonfirst/go-whosonfirst-sqlite-features/pull/4 got merged using the same index name I chose, so the good news is that soon enough the canonical distributions will include this index by default, in the interim, there will be a first-access penalty incurred when generating the index in databases which don't have it.

Overall it's a really important addition to the codebase, so thanks @Joxit for all your work 🎉

orangejulius commented 5 years ago

Yeah, this is really awesome, and represents an important step forward in managing WOF data.

Two questions:

missinglink commented 5 years ago

@orangejulius good questions

A. This code doesn't include any download functionality, it just scans the $DATA_DIR/whosonfirst/sqlite directory for .db files which have previously been downloaded.

B. I can't give a definitive number for this yet, but we will certainly have to do some testing there before enabling sqlite: true in production environments.

I'm going to go ahead and merge this under the feature flag which disables it by default, we can then work towards our goal of switching from bundles to sqlite databases over the next few weeks as our confidence in it emboldens.