pelias / whosonfirst

Importer for Who's on First gazetteer
MIT License
26 stars 42 forks source link

bzip2 to lbzip2 migration to use all CPU cores #480

Closed loadit1 closed 4 years ago

loadit1 commented 4 years ago

Hi!

Line 30 in sqlite_download.sh :

bunzip2 -f "${LOCAL_BZ2_PATH}" > "${LOCAL_DB_PATH}"

Line 53 in download_sqlite_all.js :

extract = bunzip2;

Would you be so kind to advice me why not to migrate from bzip2 to lbzip2 in order to increase performance by using all CPU cores?

Thanks!

missinglink commented 4 years ago

Sounds reasonable, I'm not familiar with lbzip2.

The only negative I could see is that it uses more RAM but this is probably a decent trade-off considering the reduction in wall-clock decompression time.

https://vbtechsupport.com/1614/

Could you please open a PR and do some testing to ensure compatibility?

orangejulius commented 4 years ago

Yeah, agreed. Speeding up the process of decompressing massive WOF bzip2 archives sounds good to me :) I'd be happy to see something like this tested out.

Since I imagine lbzip2 is not commonly installed on most systems, perhaps we can use it only if its present? We could add it to the Pelias docker images if it does show significant performance benefit.

missinglink commented 4 years ago

Hmmm so according to that benchmark I linked lbzip2 used over 100x as much RAM as bzip2.

Definitely worth investigating a little bit more and possibly detecting/logging OOM errors on modest systems.

loadit1 commented 4 years ago

Hi!

PR is just created: #481

Summary of changes:

  1. If lbzip2 installed on system we use it. If not, using legacy bzip2.
  2. Updated Dockerfile to install lbzip2
  3. Additional dependency: command-exists

Tested lbzip2 version on my machine for command npm run download. Results are below:

Using lbzip2

542.68user 101.13system 1:49.16elapsed 589%CPU (0avgtext+0avgdata 210392maxresident)k 201176inputs+15375728outputs (1001major+1221746minor)pagefault 0swaps

Using bzip2

526.88user 168.91system 3:01.95elapsed 382%CPU (0avgtext+avgdata 48292maxresident)k 91608inputs+15375736outputs (275major+70098minor)pagefaults 0swaps

Frankly speaking, speed results are not so different just because we use parallel download and parallel run of several instances of bzip2 ( const simultaneousDownloads in download_data_all.js, for example ). Memory consumption is just 4 times higher for lbzip2 (200MB vs 50MB)

But reason why I start to investigate it is that in PiP there is hardcoded one single sqlite file whosonfirst-data-latest.db which extracted extremely slow by one CPU core when not using lbzip2.

NickStallman commented 4 years ago

Seems like a reasonable trade off to me. It's unlikely anyone is running this on a extremely low memory machine.

@loadit1 how many CPU cores did you have for your test? I just did a test with my Threadripper 16 core test server with tons of ram on whosonfirst-data-latest.db.bz2. bunzip2 was 10 mins 43 seconds. lbunzip2 was 1 minute 16 seconds.

Even with the parallel download, on the Threadripper the CPU was never pushed very hard with bunzip2 so it would make a significantly bigger difference on larger systems.

loadit1 commented 4 years ago

Hi @NickStallman

I don't remember exact specs on VM for my tests above, so I decided to test once again with 2 options:

  1. Single file extract: whosonfirst-data-latest.db.bz2 This scenario is important when we follow pelias/docker instructions as during pelias download all step extraction of whosonfirst-data-latest.db.bz2 consume a lot of time because bzip2 utilizes only one CPU core during the extraction process.
  2. "Real" scenario with pelias/whosonfirst downloading and extraction. This is not "clean" from perspective of influence from network speed as well as files downloading in parallel and then extracting in parallel by bzip2, so we should have here better performance for multicore systems and less difference in extraction time comparing to single file extract.

Both options tested on Medium and Tiny VM setups.

HW specs of my Host machine: Core i5-8250U, 24Gb RAM, Samsung SSD 840 EVO I use Hyper-V, so below specs of Hyper-V VMs and their vCPUs.

Test 1: Medium VM setup 8 vCPUs (100% Host machine resource allocation), 16Gb RAM, Ubuntu Server 19.10

Option 1. Single file whosonfirst-data-latest.db.bz2 Download time: 2min 32sec (to consider network influence for pelias/whosonfirst test) bzip2 extract: 13min 30sec lbzip2 extract: 4min 0sec Option 2. pelias/whosonfirst downloading and extraction: bzip2 version time: 3min 10sec lbzip2 version time: 1min 40sec

Test 2: Tiny VM setup 2 vCPUs (25% Host machine resource allocation), 2Gb RAM, Ubuntu Server 19.10

Option 1. Single file whosonfirst-data-latest.db.bz2 Download time: 2min 36sec (to consider network influence for pelias/whosonfirst test) bzip2 extract: 13min 42sec lbzip2 extract: 5min 33sec Option 2. pelias/whosonfirst downloading and extraction: bzip2 version time: 4min 3sec lbzip2 version time: 4min 17sec

My test bash script:

#!/bin/bash
set -x

#Test created to run on fresh installed OS. Please consider to change or remove lines below if you already have some packages
apt-get -y update
apt-get -y upgrade
apt-get -y install lbzip2 bzip2 curl
curl -sL https://deb.nodesource.com/setup_12.x | bash -
apt-get -y install nodejs
apt-get -y autoremove
apt-get -y autoclean

mkdir /tmp/test
cd /tmp/test

# Download whosonfirst-data-latest.db.bz2
SECONDS=0
wget -O whosonfirst-data-latest.db.bz2 https://dist.whosonfirst.org/sqlite/whosonfirst-data-latest.db.bz2
echo "Download time: $(($SECONDS / 3600))hrs $((($SECONDS / 60) % 60))min $(($SECONDS % 60))sec"

SECONDS=0
# test bzip2
bzip2 -dk whosonfirst-data-latest.db.bz2
echo "bzip2 extract: $(($SECONDS / 3600))hrs $((($SECONDS / 60) % 60))min $(($SECONDS % 60))sec"
rm -rf whosonfirst-data-latest.db

SECONDS=0
# test lbzip2
lbzip2 -dk whosonfirst-data-latest.db.bz2
echo "lbzip2 extract: $(($SECONDS / 3600))hrs $((($SECONDS / 60) % 60))min $(($SECONDS % 60))sec"
#remove both files
rm -rf whosonfirst-data-latest.*

#test bzip2 on real pelias/whosonfirst. 
git clone https://github.com/pelias/whosonfirst.git
cd /tmp/test/whosonfirst
npm install
SECONDS=0
npm run download
echo "bzip2 version time: $(($SECONDS / 3600))hrs $((($SECONDS / 60) % 60))min $(($SECONDS % 60))sec"

cd /tmp/test
rm -rf whosonfirst

#test lbzip2 on real pelias/whosonfirst
git clone -b lbzip2 https://github.com/loadit1/whosonfirst
cd /tmp/test/whosonfirst
npm install
SECONDS=0
npm run download
echo "lbzip2 version time: $(($SECONDS / 3600))hrs $((($SECONDS / 60) % 60))min $(($SECONDS % 60))sec"

cd /tmp
rm -rf test

You may use my bash test script to test other HW configs even with less RAM if smaller setups is using widely. It is make sense to review other Pelias repositories and add lbzip2 support in Dockerfile and scripts as using bzip2 and tar with flag -j (without flag --use-compress-program=lbzip2) is slowing down extraction process on multicore systems.

missinglink commented 4 years ago

Thanks for the benchmarks, they really help to put my mind at ease!

Our docs say the absolute minimum RAM for Pelias is 8GB and we really recommend 16GB.

We just recently suffered a bug where running another program on a 64 core machine caused OOM errors due to each core potentially using >2GB RAM at peak, so I wanted to avoid that here.

It's a fairly recent issue since virtualization/containerization allows for uncommon CPU/RAM ratios these days :man_shrugging:

Looking at your tests it seems that a ratio of 1GB per CPU core is adequate to avoid OOM errors.

I'm happy to merge this, thanks for taking the time to investigate :smiley:

missinglink commented 4 years ago

Thanks @loadit1, I have added you to our @pelias/contributors team which means you can create your own branches within the pelias org repos.

When you do that, please prefix the branch with your username, eg loadit1/name-of-branch, the advantage of this is that docker images will be automatically built for every version of your branch like this and one for the tip of the branch like this.

Thanks 🎉

Screenshot 2020-02-25 at 10 56 35
loadit1 commented 4 years ago

@missinglink good to know, thank you!

orangejulius commented 4 years ago

Thanks @loadit1 for all the PRs and the extensive benchmarks :)

The faster extraction should really help!