Open mmd-osm opened 5 years ago
Test 2019: 1365 OSM days processed in 2.5 days.
Compilation settings:
../src/configure --enable-lz4 CXXFLAGS="-Werror=implicit-function-declaration -D_FORTIFY_SOURCE=2 -fexceptions -fpie -Wl,-pie -fpic -shared -fstack-protector-strong -pipe -Wl,-z,defs -Wl,-z,now -Wl,-z,relro -flto=1 -fwhole-program -O2 -fopenmp -march=native -fno-omit-frame-pointer -g -ggdb -std=c++11 -I/home/user/overpass/src/third_party/libosmium/include -I/home/user/overpass/src/third_party/protozero/include" --prefix=/home/user/overpass.dest/ LDFLAGS="-lpthread -lbz2 -licuuc -licui18n -flto=1 -fwhole-program" --enable-fastcgi
diff ~/overpass/src/bin/apply_osc_to_db.sh apply_osc_to_db2.sh
92c92
< while [[ ( -s $REPLICATE_DIR/$REPLICATE_FILENAME.state.txt ) && ( $(($START + 1440)) -ge $(($TARGET)) ) && ( `du -m $TEMP_DIR | awk '{ print $1; }'` -le 512 ) ]];
---
> while [[ ( -s $REPLICATE_DIR/$REPLICATE_FILENAME.state.txt ) && ( $(($START + 10)) -ge $(($TARGET)) ) && ( `du -m $TEMP_DIR | awk '{ print $1; }'` -le 2048 ) ]];
97c97
< gunzip <$REPLICATE_DIR/$REPLICATE_FILENAME.osc.gz >$TEMP_DIR/$TARGET_FILE.osc
---
> cp $REPLICATE_DIR/$REPLICATE_FILENAME.osh.pbf $TEMP_DIR/$TARGET_FILE.osh.pbf
108c108
< ./update_from_dir --osc-dir=$1 --version=$DATA_VERSION $META --flush-size=0
---
> ./update_from_dir --osc-dir=$1 --version=$DATA_VERSION $META --flush-size=0 --parallel=8 --use-osmium
114c114
< ./update_from_dir --osc-dir=$1 --version=$DATA_VERSION $META --flush-size=0
---
> ./update_from_dir --osc-dir=$1 --version=$DATA_VERSION $META --flush-size=0 --parallel=8 --use-osmium
Daily diff sizes 2012 - 2019.
Cumulative daily diff sizes:
convert daily diffs to pbf format, use processing as above
Sep 2012 - April 2019: 4 days (full attic)
Area creation (full): 3h 25m (with vector+binary search on file block index entries)
Originally posted here: https://listes.openstreetmap.fr/wws/arc/overpass/2016-05/msg00011.html
as part of the Overpass Performance Project 2016, improving the overall time to create a full attic db was one of the primary focus topics. If you set up your own instance before, you probably used the existing clone files. That is still the recommended approach for most users. However, when switching to a different compression algorithm or in case of bugs in the template db implementation, being able to quickly set up a full attic database from scratch is of paramount importance.
Unfortunately, there's very little documentation available on previous run times. Back in 2014, Roland set up a database for roughly 700 days, with reportedly took less than 1 week. That didn't include compression at that time. For the current v0.7.52 zlib compressed database, I couldn't find any figures at all. Some Github ticket suggest, that the current rate of catching up using minutely diffs is about 30-fold real time.
It's about time to dig a bit deeper.
Initial tests on the dev instance quickly turned out to be quite time consuming with an estimated total runtime of at least 6 weeks. After switching to a more powerful 8 core server with 32 GB memory and SSD, initial tests on lz4 cut the time down to 13 days. Processing updates was done using daily diffs rather than minutely diffs. That still seemed quite a lot for 1340 days (=all changes since the license change in September 2012). Large parts of the processing were CPU bound to due to fast SSDs. Nevertheless, only 1 core was used all of the time.
I decided to move dedicated parts of the database update logic to multi threaded processing (based on C++11 standard mechanism, no external libs). That affects solely those parts where 8 different files are first read from disk, decompressed, changes applied, compressed and written to disk again. Also, I reorganized the database a few times via db cloning, mainly to cut down disk space. That brought down the full attic db setup down to 8-8.5 days.
Next step was to increase the number of days, which are handled in one update run. So far, I used update_database, but then switched to update_from_dir and apply_osc_to_db.sh. Usually, that script is used to apply several minutely diffs in a go. Well, why not use that mechanism to apply several days at once, permitting up to 4GB of uncompressed change file? Depending on the data, this corresponds to 6-12 days worth of OSM data. Running the update this way seemed to work quite well on 32 GB main memory, although update_from_dir sometimes needed more than 20 GB. If you're short on main memory, that may not be an option.
Well, luckily, the total processing time dropped down to just 4 days,corresponding to about 330 OSM days per day. This should be good enough for the time being.
Two additional points worth noting:
I put lots of stats on the wiki page [1]. If I find some more time, I'll probably add further comments to that page. Also, you can find the full attic db for lz4 on the dev instance [2]. The respective branch is mentioned on the wiki page as well.
Best, mmd
[1] https://wiki.openstreetmap.org/wiki/User:Mmd/Overpass_API/Performance_Project_2016/Full_Attic_DB_Setup [2] http://dev.overpass-api.de/clone_lz4/