Closed Joxit closed 5 years ago
I have a question, should a provide a script for database optimisation/size reduce ?
Because when we download/build the database from WOF, we have tons of useless data and we can reduce the database size by removing useless entries in geojson
/spr
/ancestors
and droping names
and concordances
.
Maybe this is interesting only for those who make their own WOF database and can differ between use cases :confused:.
I did some performance tests for SQLite versus meta/bundle on a world instance.... The fact is SQLite is faster than meta/bundle because bundle is limited by IO of tiny files.
When I use filtered queries (by placetypes), meta/bundle is faster than SQLites for placetypes from oceans (7 records in 7s vs 52s) to country (210 records 23s vs 67s). But SQLite is faster from region (4k records 96s vs 115) to locality (163k records 1973s vs 6109s)
Here is the full result
place type | size | time (meta/bundle) | time (SQLite) |
---|---|---|---|
ocean | 7 | 3s | 52s |
continent | 8 | 3s | 54s |
empire | 11 | 5s | 52s |
dependency | 40 | 8s | 52s |
macroregion | 94 | 11s | 57s |
country | 210 | 23s | 67s |
borough | 241 | 13s | 53s |
marinearea | 305 | 9s | 52s |
macrocounty | 371 | 12s | 54s |
region | 4k | 115s | 96s |
county | 41k | 451s | 147s |
localadmin | 112k | 922s | 150s |
locality | 163k | 6109s | 1973s |
The benchmark was performed on a public cloud.
Hi @missinglink @orangejulius,
I think this part is now OK
Hey @Joxit, I haven't reviewed all the code yet, but this looks really good overall. In trying it out I've had some download issues, always with the main sqlite database. Here's one example of an error:
done downloading whosonfirst-data-latest.db.bz2
error downloading whosonfirst-data-latest.db.bz2Error: stderr maxBuffer exceeded
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
18 6454M 18 1199M 0 0 474k 0 3:52:16 0:43:10 3:09:06 481k
To my knowledge it wasn't caused by disk or RAM issues. The sqlite DB is bigger than any of the bundles, do you think there's a possibility of the existing download code having problems as filesizes grow?
I'll let you know how it looks after I've fully tested an import.
Hey,
It's not the first time I see this error, it's related to node and child process. On a low connection or big file, curl do too much output for node and fulfill the buffer. We shoul invoke curl in silent mode and this will fix the issue (even for bundles).
Hey @Joxit, I was able to test out this code and it seems to work well!
I had to change the exec line in ./bin/start
to exec node --max_old_space_size=4096 import.js
in order to avoid memory limit issues, but it imports just fine.
I did notice it's quite a bit slower to import than the file based import (38 minutes vs 11 minutes on my system), but that's not a huge concern, especially considering the pip-service looks like it will actually load faster.
Hey @orangejulius,
Yeah ! That's cool :smile: Change done !
Hum... That's a bit sad if SQLite import is slower than file base import.... :disappointed:
@orangejulius did you see any differences in the number of imported elements ?
heya, sorry it's taken so long for me to get to this, I'd like to push this forward and get it merged.
@Joxit do you have any other commits you'd like to add or is it good-to-go? Do you have any experience running it in a production setting?
The whosonfirst bundles are falling quite far out of date and the time is now right to drop support for them.
I just tested this out on with pelias/docker
for the portland-metro
project and it ran for a really long time...
It seems as though the importPlace section is not being honoured by the whosonfirst
importer in sqlite: true
mode, I was importing the whole planet 😝
I'm guessing your intention is that the user should download the full planet sqlite database and then use the bin/sqlite-clean
script to remove any superfluous rows from the database, reducing the filesize in the process?
If that's the case then we will need to modify bin/sqlite-clean
to honour importPlace
.
Although I would suggest a different approach...
We could instead simply filter out the superfluous records within the query clause itself.
The advantage would be that the sqlite database remains unmodified on disk, so developers can continue using it for debugging and for later expanding their geographic target area without re-downloading the massive db again each time.
This would also have the advantage of playing well with tooling that hashes the database when comparing against a copy stored on the whosonfirst.org servers.
I think in some production use-cases it might be beneficial to 'prune' the database using bin/sqlite-clean
, so it's great to have around for that use-case if required, but for the general populous I think having more bytes on disk is a good tradeoff when they aren't yet sure of what data they may or may not need from the planet database.
Let me know what you think @Joxit, I know you've spent a bunch of time on this already, so I could do the dev work and you could do the review if that sounds preferrable to you?
It's ok :wink:, thank you for your review !
Yes, we are using this in production since last month, especially for pip-service and it works like a charm.
Yes, I was targeting only full planet import with this PR, which need too many files. That why importPlaces is ignored :sweat_smile:
You are right, bin/sqlite-clean
should honour importPlaces
, I completely forgot this part. This can be also useful for small configurations (laptop).
I think filter the records in the query will increase the import time and will be slower than the file-based version. We should try :smile:
I will try to make the changes before I go to Japan at the end of the month.
Hi @missinglink !
Now this PR supports importPlaces
via sqlite import :grin:
Thanks @Joxit I had some issues running it today but solved them all in https://github.com/pelias/whosonfirst/pull/424, could you please review and merge those commits in here?
I've made some changes to how the iterator works based off the docs:
... _read() should continue reading from the resource and pushing data until readable.push() returns false
Hopefully this results in a speed improvement, combined with increasing the highwaterMark
will mean that 32
records will now be read from the database per tick of the event loop instead of reading 1 per tick.
Something is broken with the Travis-CI integration, the tests are showing as passing here https://travis-ci.org/pelias/whosonfirst/builds/506783601
I think this is good-to-merge now.
I like that this is disabled by default, but I'd also like to make this the new default setting once it's been tested thoroughly. For the time being, It's important that we continue to support backwards compatibility with the bundles and continue fixing any errors that arise relating to bundles.
After merging this I'm going to test it again and then I will gradually update the projects in pelias/docker
to use config sqlite: true
, the docker projects are fairly active and so we'll get feedback fairly quickly if something is not working correctly.
PR https://github.com/whosonfirst/go-whosonfirst-sqlite-features/pull/4 got merged using the same index name I chose, so the good news is that soon enough the canonical distributions will include this index by default, in the interim, there will be a first-access penalty incurred when generating the index in databases which don't have it.
Overall it's a really important addition to the codebase, so thanks @Joxit for all your work 🎉
Yeah, this is really awesome, and represents an important step forward in managing WOF data.
Two questions:
dataHost
config option (last one in the list of configuration options)? This is important for those of us running services based on WOF data as it allows us to provide our own custom data patches without having to wait for the WOF team to merge@orangejulius good questions
A. This code doesn't include any download functionality, it just scans the $DATA_DIR/whosonfirst/sqlite
directory for .db
files which have previously been downloaded.
B. I can't give a definitive number for this yet, but we will certainly have to do some testing there before enabling sqlite: true
in production environments.
I'm going to go ahead and merge this under the feature flag which disables it by default, we can then work towards our goal of switching from bundles to sqlite databases over the next few weeks as our confidence in it emboldens.
This PR add supports to SQLite databases. pelias/whosonfirst will still supports bundle, for sqlite import, a new options will be added :
imports.whosonfirst.sqlite
(false by default for backward compatibility).TODO list:
imports.whosonfirst.sqlite
configurationutils/download_data_all.js
fixes #416