pelias / whosonfirst

Importer for Who's on First gazetteer
MIT License
27 stars 43 forks source link

bzip2 to lbzip2 migration to use all CPU cores #481

Closed loadit1 closed 4 years ago

loadit1 commented 4 years ago

Summary of changes:

  1. If lbzip2 installed on system we use it. If not, using legacy bzip2.
  2. Updated Dockerfile to install lbzip2
  3. Additional dependency: command-exists

Tested lbzip2 version on my machine for command npm run download. Results are below:

Using lbzip2

542.68user 101.13system 1:49.16elapsed 589%CPU (0avgtext+0avgdata 210392maxresident)k 201176inputs+15375728outputs (1001major+1221746minor)pagefault 0swaps

Using bzip2

526.88user 168.91system 3:01.95elapsed 382%CPU (0avgtext+avgdata 48292maxresident)k 91608inputs+15375736outputs (275major+70098minor)pagefaults 0swaps

Frankly speaking, speed results are not so different just because we use parallel download and parallel run of several instances of bzip2 ( const simultaneousDownloads in download_data_all.js, for example ). Memory consumption is just 4 times higher for lbzip2 (200MB vs 50MB)

But reason why I start to investigate it is that in PiP there is hardcoded one single sqlite file whosonfirst-data-latest.db which extracted extremely slow by one CPU core when not using lbzip2.

Fixes #480