Alternate node store - Githubissues

cldellow commented 9 months ago

I'm slowly remembering how to write C++. This is what #571 probably should have been.

This PR:

extracts a NodeStore interface
renames the old NodeStore to BinarySearchNodeStore
adds SortedNodeStore, a store that exploits PBFs that have the Sort.Type_then_ID feature.

This allows the efficient construction of a store that arranges nodes in a three-level hierarchy:

Level 1 is groups: there are 256K groups
Level 2 is chunks: each group has 256 chunks
Level 3 is nodes: each chunk has 256 nodes

This allows us to store 2^34 nodes, with a fixed overhead of only 2M -- the space required for the level 1 pointers. (And 2^35 nodes with fixed overhead of 4M, and so on.)

Groups and chunks store their data sparsely, e.g. if a group has 7 chunks, it only uses storage for 7 chunks.

The precise storage cost varies based on the underlying PBF. Nova Scotia uses ~8.61 bytes per node, Great Britain uses ~9.13 bytes per node.

Measurements for reading Great Britain, master:

peak memory 13,890MB
real 1m39.446s
user 20m21.833s
sys 0m53.539s

This branch:

peak memory 12,900MB
real 1m33.271s
user 18m41.343s
sys 0m58.368s

This store should also perform better in the face of swapping, as lookups will have much better locality than the binary search of the previous store.

It should also be straight-forward to use something like simdcomp to extend this to store delta-encoded, zigzag-encoded, bitpacked data, with each worker keeping a per-thread uncompressed cache of its recently used chunks.

SortedNodeStore was complex enough that I wanted to give it its own files vs ramming it into osm_store -- this entailed refactoring some existing files. I also moved a few things around so that the node stores could build without pulling in the boost geometry templates, as they have a significant impact on compile times. Sorry for the impact that the churn has on reviewability :(

Open questions:

1 - Can we require that everyone only uses PBFs that support Sort.Type_then_ID ? If yes, we can delete BinarySearchNodeStore in favour of SortedNodeStore. If not, I'll do some work to intelligently pick which one to use based on the first PBF we read.

2 - I feel like there must be a more elegant way to do the work distribution in read_pbf. Perhaps the change for ReadPhase::Nodes should just be the default for all phases?

3 - This adds libpopcnt.h, a library to do fast population counts of bitvectors. It's BSD-licensed. How would you like attribution handled? Looks like other libraries are just cited in Tilemaker's README?

cldellow commented 9 months ago

In particular, this code doesn't need to sort any of the mmaped memory, so I'd expect it to perform much better than the old NodeStore for datasets that don't fit in memory.

systemed commented 9 months ago

Wow - super interesting. I'll try and benchmark this tonight on some reasonably sized extracts.

systemed commented 9 months ago

This works really nicely for GB and Germany: it's a 6.5% and 5% (respectively) memory saving over master.

It currently segfaults for Europe using --store:

/usr/bin/time -v tilemaker --input /media/data3/planet/europe-latest.osm.pbf --output /media/data4/europe-omt.mbtiles --store /media/ssd/tmp/
[...]
Reading .shp glacier
Sorting loaded shapes
Reading .pbf /media/data3/planet/europe-latest.osm.pbf
Command terminated by signal 11 ways 0 relations 0

(I'll try with smaller areas using --store later. For now I'm benchmarking Europe with master!)

1 - Can we require that everyone only uses PBFs that support Sort.Type_then_ID ? If yes, we can delete BinarySearchNodeStore in favour of SortedNodeStore. If not, I'll do some work to intelligently pick which one to use based on the first PBF we read.

Experience is that people will try and throw all sorts of wild and wacky PBFs at tilemaker ;) so we should probably support both. But you should just be able to read the option flag from the PBF - PbfReader::ReadPbfFile already does that for LocationsOnWays.

2 - I feel like there must be a more elegant way to do the work distribution in read_pbf. Perhaps the change for ReadPhase::Nodes should just be the default for all phases?

Worth trying?

3 - This adds libpopcnt.h, a library to do fast population counts of bitvectors. It's BSD-licensed. How would you like attribution handled? Looks like other libraries are just cited in Tilemaker's README?

Yes - that should be fine!

cldellow commented 9 months ago

Oops, I'm embarrassed to admit I didn't even test --store, I assumed it would work for free by virtue of the mmap allocators. :(

Thank you for noticing! Will have a look at investigating + fixing that.

systemed commented 9 months ago

With europe-latest.osm.pbf (today's download from Geofabrik) and --store it now says:

terminate called after throwing an instance of 'std::runtime_error'
  what():  SortedNodeStore: scaledOffset too big

Let me know if there's anything that'd be helpful for me to do to debug!

cldellow commented 9 months ago

Oops, yes I see that also happens with the North America extract. I've pushed a fix. (This still isn't ready to merge but should be suitable for testing.)

cldellow commented 9 months ago

Ok! I think this is probably good now. The most recent commit buys a bit more memory savings - ~6 bytes/node or so for most extracts.

cldellow commented 9 months ago

This lays more groundwork for lazy geometries. Once this and refactor_geometries are merged into master, it should be possible to change OsmMemTiles to compute geometries on the fly and get a large memory savings.

Changes since my last comment:

extract a WayStore interface
provide a SortedWayStore that can be used in the same circumstances as SortedNodeStore
- stores node IDs compactly
- omits final node for closed ways, instead stores a bit to track that the way is closed
- packs bits 32..35 into a single nibble, rather than allocating an entire 4-byte integer
- special case of the above: if all nodes in the way have the same bits 32..35, stores them only once
- supports compressing the lower 32-bit values via streamvbyte
- adds minunit for unit testing
- simplistic, but useful for faster dev cycles, available via make test
- only available via Makefile, not via CMake, so doesn't run as part of CI
extract mmap_allocator.cpp so that unit tests don't have to drag in osm_store.cpp

SortedWayStore with --compress-ways stores way information with a cost between 2.3 bytes/node (North America) to 2.9 bytes/node (GB). master achieves about 8.1 bytes/node (plus overhead from std::deque that I'm too lazy to estimate).

In general, its impact is not very meaningful yet, as we currently only store those ways that are later used in relations, but it's an important step on the way to lazy geometries.

Timings and memory use for GB:

alternate-node-store: 13,000MB, 2m48s
alternate-node-store with --compress-nodes --compress-ways: 12,840MB, 2m55s
- I'm a little skeptical of my memory measurements. They show the resident memory of the process as reported by htop...but I know separately that --compress-nodes uses 600MB less memory, so it's confusing to me that the resident memory only decreased by 160MB. Ah, well.
master: 13,950MB, 3m1s

systemed commented 9 months ago

Looks great!

I've just merged refactor_geometries into master so we can iterate on that from now on.

cldellow commented 9 months ago

Great, I've merged master into this branch & will update my other PRs to target master.

New measurements:

master: 10,500 MB

real 2m29.892s
user 35m52.622s
sys 0m24.567s

this branch: 9,500 MB

real 2m14.724s
user 32m8.778s
sys 0m22.840s

this branch, --compress-nodes --compress-ways: 9,400 MB

real 2m17.028s
user 32m43.232s
sys 0m22.265s

cldellow commented 9 months ago

It looks like I'm now able to process North America on my 32GB laptop, which is great. Previously, even Canada was too much, and would cause me to OOM. I say "looks like", because I aborted at 66% done, after 188 minutes.

With --compress-nodes --compress-ways, it needed ~12GB of RAM. Presumably that's mainly the output objects + attributes, which can't spill to disk yet.

During the reading phase, I mostly kept my 16 cores pegged - there was the odd stutter due to running out of I/O.

During the writing phase, my cores were always pegged. Disk reads were mostly in the 5-25MB/sec range, with very occasional spikes to 300-600MB/sec.

The new z6 progress indicator did draw my eye to something fishy... I'll write up my observations in an issue before I forget.

systemed commented 9 months ago

I'll run a few benchmarks for Europe over the next couple of days. I have actually managed to run Europe in-memory with this branch but that peaked at 128GB usage which is probably a bit niche!

systemed commented 9 months ago

As per #595 we're down below 144GB RAM usage for Europe, so whatever I specify it'll be in-memory anyway.

With --compress-nodes --compress-ways it's 3h11m48, 127.3GB RAM used.
Without, it's 3hr10m23, 136GB.

The difference in execution time is sufficiently small, and the memory saving significant enough, I wonder if we should just default to --compress-nodes --compress-ways.

The generated map looks good from a quick scout around and free of geometry issues.

We're 20 minutes slower compared to when #595 was merged (2hr50). That's not the end of the world but I wonder what's causing it. I tried removing the backtracking from #602 but that didn't make any difference.

cldellow commented 9 months ago

I wonder if we should just default to --compress-nodes --compress-ways.

I can make that change and instead offer --no-compress-nodes and --no-compress-ways flags. I think the streamvbyte authors expect that the code should work on any CPU from 2010 or later, which is pretty broad.

We're 20 minutes slower

Hmm, two ideas:

is it possible that some of it is I/O on initial read?

It wouldn't explain the whole 20 minutes, but for me, if I haven't got the PBF entirely in my kernel buffers already, I see a substantial delay on initial startup while it reads through the PBF to find the blocks. For North America, it's ~1 min or so on my pretty fast SSD. You can generally rule this in/out by either forcing the PBF into buffers before running tilemaker (run time cat pbf > /dev/null until it's instantaneous) or forcing the PBF out of buffers by flushing them (on Linux, echo 3 | sudo tee /proc/sys/vm/drop_caches)

(Separately, I notice that for me, Tilemaker is slower at single-threaded reading of an unbuffered file than cat is - so there's probably some room to improve this.)

I vaguely recall your reference system might be somewhat older -- does its CPU support an efficient popcnt instruction?

I assume the slow down is from the node store. I would normally expect this store to be faster than the old store, because the old store does a binary search. Extrapolating from North America, I assume Europe has ~3.8B nodes, so you need approx 32 searches to find any given node. This store should be able to find the node by doing 2 indirections, but each indirection needs a popcnt. As I'm typing this out, I'm skeptical, because even a naive, non-optimized popcnt should be relatively fast.

Maybe a cheap test: do you see a similar slow down with GB? For me, this store was a little faster on GB than master.

cldellow commented 9 months ago

Although, if your CPU is an X5650 (mentioned https://github.com/systemed/tilemaker/issues/315#issue-994322040), I think it supports popcnt. You can confirm with grep popcnt /proc/cpuinfo on Linux.

systemed commented 9 months ago

It is indeed an X5650 and it looks like it does support popcnt. I'm not 100% sure the slowdown is necessarily a result of this branch - it's possible I've got confused between branches and it was another recent commit (it's also possible we've since obsoleted it!). I'll do a bit more digging but it might not be anything to worry about.

Running this branch on Europe with #607 and #608 (plus https://github.com/systemed/tilemaker/commit/0dbbe6ed380a41572661fb575d29d7e0e1ff0020 which speeds up .shp processing, and https://github.com/systemed/tilemaker/commit/4bd28f15ed35882a0df9aa67e35f23f5e3dba0f7 which speeds up geometry output):

wall clock 2:18:46 (down from 3:11:48 in previous comment, 5:54:49 before this whole batch of changes) 🎉
memory 130GB

systemed commented 9 months ago

Merged - thank you. Amazing gains.

We should probably move minunit.h into include/external/ and move sorted_way_store.test.cpp into a new test/ directory but that's not urgent.

systemed / tilemaker

Alternate node store #590