be able to render the planet with 32gb of RAM

cldellow commented 8 months ago

This PR lets Tilemaker build the planet on smaller machines.

On a Vultr 16-core, 32GB, 500GB SSD machine:

$ time tilemaker --store /tmp/store --input planet-latest.osm.pbf --output tiles.mbtiles --shard-stores
real    195m7.819s
user    2473m52.322s
sys 73m13.116s

Runtime for non-memory constrained boxes isn't affected, e.g. on a Hetzner 48-core, 192 GB machine:

$ time tilemaker --store /tmp/store --input planet-latest.osm.pbf --output tiles.mbtiles
real    65m20.082s
user    2570m33.530s
sys 41m15.420s

On a $ basis, if you're renting a machine to do the work, it's cheaper to use a bigger box. But for folks who need to use what they already have, this may be a useful PR.

The changes are a mix of using less memory, spilling more things to disk, and thrashing less when things are backed by disk.

Using less memory:

~1GB: extend --materialize-geometries to points -- points from Layer(...) can be looked up in the NodeStore. LayerAsCentroid(...) still needs the point store
~1.5GB: rejig AttributePair
- eliminate padding
- use a union for the string and float values
- replace std::string with PooledString
~4GB: use a custom container (AppendVector) rather than a vector of vectors for storing OutputObjects
- vector's grow-by-doubling behaviour results in some wasted memory. I initially tried to replace it with a deque, but deque's 512-byte allocation size results in poor locality on the disk

Spill more things to disk:

~12GB: the OutputObjects now spill to disk when --store is used

Thrash less:

materialize the list of low-zoom objects, so that we only scan the list of 1.3B output objects a single time, not 1,365 times
compute the set of tiles with objects simultaneously for all zooms, so that we only scan the list of 1.3B output objects a single time, not 15 times
when --shard-stores is set, split the NodeStore and WayStore into 7 stores that cover different parts of the globe
- the idea is to have roughly equal sized splits in terms of nodes/ways/relations, I started with a best guess than iterated a couple of times based on memory usage reported by the stores: https://geojson.io/#id=gist:cldellow/00d9d9d627494c522c31fc5a63909749
- in this mode, ReadPhase::Ways will run 7 times, populating a single WayStore on each pass. Only those ways whose first node is in the corresponding NodeStore get populated. Because nodes in ways generally are geographically near each other, we'll mostly be accessing a single NodeStore to process the way. That NodeStore fits into memory for the duration of the pass, avoiding disk I/O.
- ReadPhase::Relations behaves similarly, using the ID of the first way to decide whether to process the relation.
- when writing, since we group by z6 tile, we'll have long runs that use the same stores, which means we'll only need to do new disk I/O when the writer starts a new region

Potential future improvements:

We still need ~14GB of RAM to read everything. It might be worthwhile to try to account for all of it, 14GB feels excessive. Possible culprits: protobuf reader (~160MB/core, I think), attribute store and friends, the r-tree index for large items.
The sharding is tuned for the planet on a 32GB box. Being able to dynamically pick the shards based on the bounding box and actual memory available could be useful.
The runtime benefit of multiple passes for relations is thwarted a bit by straggler relations that take an abnormally long time to process (Antarctica, Hudson Bay, etc). If we could cheaply identify the blocks that have such relations, we could start processing them earlier in the hopes that they'd be done by the time we were done the other relations.
- ...actually, maybe it's the boost thread pool more generally? You see a similar effect when reading ways. I don't know how it works under the covers - maybe threads grab a batch of tasks at once, and you end up with a single thread hoarding some work items while the rest of the threads starve. If that's the case, a task-stealing approach might get better utilization towards the end of the work queue.

These are mostly smaller issues that can be happily ignored forever, just wanted to write them down so I can forget about them.

systemed commented 8 months ago

That's really impressive - thank you!

I haven't had the chance to go through all the source yet but the results look very impressive - I ran my usual Europe extract through it (with shapefiles), and memory consumption was 8GB when reading the .pbf, going up to 9GB when generating tiles. Total time 2hr13. I'll have a go at the planet tomorrow.

systemed commented 8 months ago

Using the old (mid-2021) planet I've run previous tests with, and including shapefiles, memory consumption was 18.2GB - which is amazing. Total time 5hr39. (Before this PR it was 5hr12 and 40.2GB.)

Comparing with Europe, that suggests a very rough estimated RAM requirement of one-third the .osm.pbf size.

systemed commented 8 months ago

Played with this a bit more today and still impressed. Also thanks for the copious comments which help me to understand what's going on!

I think the only suggestion I'd make is that we now have a fairly broad array of performance options (--no-compress-nodes, --no-compress-ways, --materialize-geometries, --shard-stores, plus of course --store and --compact have performance implications). I suspect most users won't understand which to pick.

I guess there are three common scenarios:

Small extract (do everything in memory)
Planet or large extract on expansive hardware (use store and optimise for run-time)
Constrained hardware (use store and optimise for RAM consumption)

These could perhaps be represented by the following run-time options:

(no flags specified)
--store /path/to/ssd --fast (equivalent of --materialize-geometries on, --shard-stores off)
--store /path/to/ssd (equivalent of --materialize-geometries off, --shard-stores on)

We can then simply tell people "if you have lots of memory and are working with a big extract, use the --fast option".

We can still retain the granular controls, but maybe put them in a separate "performance tuning" option group.

cldellow commented 7 months ago

Yes, good call on the flags and de-emphasizing the individual knobs. I'll make that change.

cldellow commented 7 months ago

Hopefully you ignored the noise of my commits during Christmas! :) Please don't feel any urgency to do anything with this or the other PRs I'll open this week -- this is just my version of tinkering with trains in the basement over the holidays.

Since my last comment:

I implemented the memory saving idea in https://github.com/systemed/tilemaker/issues/622#issuecomment-1866813888
reduced number of passes to 6 when running in memory-constrained mode
refactored options parsing and implemented the spirit of your comment in https://github.com/systemed/tilemaker/pull/618#issuecomment-1866692053

I did some benchmarking [1] and observed that the logic should maybe be:

default to everything in memory, materialized geometries
- but let a user override with --lazy-geometries, e.g. in the case where lazy geometries is enough to let you avoid needing --store
if --store is passed, default to lazy geometries
- but let a user override with --materialize-geometries if they have really, really fast SSDs

The --help after this commit:

tilemaker v2.4.0
Convert OpenStreetMap .pbf files into vector tiles

Available options:
  --help                       show help message
  --input arg                  source .osm.pbf file
  --output arg                 target directory or .mbtiles/.pmtiles file
  --bbox arg                   bounding box to use if input file does not have 
                               a bbox header set, example: 
                               minlon,minlat,maxlon,maxlat
  --merge                      merge with existing .mbtiles (overwrites 
                               otherwise)
  --config arg (=config.json)  config JSON file
  --process arg (=process.lua) tag-processing Lua file
  --verbose                    verbose error output
  --skip-integrity             don't enforce way/node integrity
  --log-tile-timings           log how long each tile takes

Performance options:
  --store arg                  temporary storage for node/ways/relations data
  --fast                       prefer speed at the expense of memory
  --compact                    use faster data structure for node lookups
                               NOTE: This requires the input to be renumbered 
                               (osmium renumber)
  --no-compress-nodes          store nodes uncompressed
  --no-compress-ways           store ways uncompressed
  --lazy-geometries            generate geometries from the OSM stores; uses 
                               less memory
  --materialize-geometries     materialize geometries; uses more memory
  --shard-stores               use an alternate reading/writing strategy for 
                               low-memory machines
  --threads arg (=0)           number of threads (automatically detected if 0)

[1]: Details in https://github.com/systemed/tilemaker/pull/618/commits/657da1ab92fcf65de3f5adafcceddc064ef5e73d - it wasn't quite this branch, it was this branch + protobuf + lua-interop

systemed commented 7 months ago

All working really well! Ready to merge, do you think?

Running this PR with Great Britain on my usual box:

/usr/bin/time -v tilemaker --input /media/data1/planet/great-britain-latest.osm.pbf --output ~/tm_debug/gb5.mbtiles
    Elapsed (wall clock) time (h:mm:ss or m:ss): 4:59.99
    Maximum resident set size (kbytes): 12275684

/usr/bin/time -v tilemaker --input /media/data1/planet/great-britain-latest.osm.pbf --output ~/tm_debug/gb4.mbtiles --lazy-geometries
    Elapsed (wall clock) time (h:mm:ss or m:ss): 5:16.00
    Maximum resident set size (kbytes): 9155756

It's a big memory saving (25%) for a small time penalty (5%) - so maybe we should default to --lazy-geometries, both for in-memory and --store. But I realise one could probably bikeshed this all day. :)

cldellow commented 7 months ago

Yup, merge away.

I have no strong views on the defaults--let me know if you'd like them changed

systemed commented 7 months ago

Merged. Thank you again - this is going to make a massive difference to users.

I'll do some experimenting with the defaults before we release 3.0 but it's not crazily urgent.

systemed / tilemaker

be able to render the planet with 32gb of RAM #618