some memory and concurrency improvements

cldellow commented 8 months ago

This PR has a few things:

housekeeping -- tidy up where some files live, extract ClipCache to its own file
reduce lock contention
use ~33% less memory, as benchmarked on GB (or, use the same amount of memory, and be ~7% faster)

The memory reduction comes from leaving most ways in the WayStore and generating the linestring or polygon on-demand at writing time. There are a few tricks to mitigate the cost of rebuilding the geometry -- we only generate on-demand those ways that didn't need to be corrected, and each thread keeps a thread-local cache of linestrings that they recently constructed.

I explored extending this to relations as well, but it seemed like there was much less opportunity with relations: each individual relation is bigger, so reconstructing it is more costly; they're more often in need of correction; and in absolute terms, they're not as big of a memory hog as all the ways. My bag of tricks wasn't able to mitigate the runtime impact. There might be something there, but I'm going to admit defeat for now.

GB, master:

9,682MB RAM

SortedWayStore: 17036 groups, 116656 chunks, 381137 ways, 16588528 nodes, 49565026 bytes
Generated points: 6477007, lines: 7443584, polygons: 15168555

real    2m19.527s
user    31m44.574s
sys 0m25.943s

GB, this PR:

6,484MB RAM

SortedWayStore: 18614 groups, 1636695 chunks, 17554284 ways, 160743637 nodes, 650662740 bytes
Generated points: 6477007, lines: 1846, polygons: 5344375

real 2m20.764s
user 33m1.989s
sys 0m19.444s

GB, this PR with --materialize-geometries:

9,600MB RAM

SortedWayStore: 17036 groups, 116656 chunks, 381137 ways, 16588528 nodes, 49565026 bytes
Generated points: 6477007, lines: 7443584, polygons: 15168555

real    2m11.685s
user    31m1.558s
sys 0m18.525s

planet-latest, this PR on a 48-core Hetzner CCX63:

SortedNodeStore: 173473 groups, 41167239 chunks, 8773657450 nodes, 44932630766 bytes
SortedWayStore: 18746 groups, 4571385 chunks, 714744586 ways, 7912084618 nodes, 21576308504 bytes
Generated points: 204334361, lines: 56549, polygons: 229203193

real    67m51.201s
user    2717m54.920s
sys 36m53.642s

tilemaker used 42GB of RAM and 249GB of virtual memory to generate the planet -- so still a bit out of reach for people on smaller machines.

systemed commented 8 months ago

That's superb!

With North America I get an even bigger memory saving - down from 20.1GB to 13.2GB (using --store and with shapefiles). I think about 4GB of that is ClipCache so there may be some opportunity to optimise further, e.g. by using an LRU cache as you mention in the comments.

systemed commented 8 months ago

On my usual machine, 40.2GB (with --store) for the full planet including the poles, 5hr12m22. Really happy with that.

It was sitting on one thread for about 10/15 minutes at the end, which I'm guessing was crunching through a particularly complex set of geometries (very possibly at the poles).

cldellow commented 8 months ago

I think about 4GB of that is ClipCache

Oh, wow. I hadn't thought too critically about this, but I didn't expect it to use quite that much memory.

The cache size does increase with more cores, so perhaps this is expected - for 24 cores, 4 GB is "only" 160MB/core.

I don't often test with the shapefiles, I should probably start doing that, especially as I think the runtime overhead is much smaller now (and especially after #614 lands).

so there may be some opportunity to optimise further, e.g. by using an LRU cache as you mention in the comments.

I imagined the LRU cache would help with runtime, but not with memory usage. The benefit I was hoping for is that we'd more often have relevant things in the cache, but I wasn't planning on shrinking the cache.

The cache might still be oversized, though. I didn't really experiment at all to find where the point of diminishing returns started, I just picked a number, observed an improvement and went with it.

It was sitting on one thread for about 10/15 minutes at the end

I also noticed this, and I noticed that GB had a similar straggler on the Hetzner box (Ubuntu 22.04).

Interestingly, the GB straggler doesn't appear when I run locally (Ubuntu 20.04) or when I ran on the Hetzner box prior to the #611 change.

So I think I concluded that something in #611 causes the slowdown, but only on more modern Boost versions than the one in Ubuntu 20.04.

According to https://packages.ubuntu.com/search?keywords=libboost-, 20.04 has 1.71 and 22.04 has 1.74.

Nothing in Boost's changelog jumps out at me, though.

systemed commented 8 months ago

The cache might still be oversized, though. I didn't really experiment at all to find where the point of diminishing returns started, I just picked a number, observed an improvement and went with it.

I did have a very brief attempt at downsizing the cache size and being a bit pickier about what we put in it (e.g. no simple polygons with <20 vertices) but it removed the performance gains. But there may still be possibilities along those lines.

So I think I concluded that something in https://github.com/systemed/tilemaker/pull/611 causes the slowdown, but only on more modern Boost versions than the one in Ubuntu 20.04.

In order of likelihood:

adding geom::correct before simplify_combine (geom.cpp) will have some overhead. It's possible that this is pathological in the case of particularly large polygons and we should call our own make_valid (aka geometry::correct) in these cases instead, but we'd have to be careful not to reintroduce the invalid geometries this was intended to fix.
we moved to the newer Boost.Geometry BOOST_GEOMETRY_NO_ROBUSTNESS behaviour
possible that something in RemovePartsBelowSize is taking a while (this should be easy to test - just remove filter_below and filter_area from the JSON config)

systemed commented 8 months ago

This looks good to merge to me - ok with that?

cldellow commented 8 months ago

Yup!

systemed commented 8 months ago

Excellent - thank you again!

systemed commented 7 months ago

Quick issue you might be able to help me with...

I've been playing around with a branch that can load GeoJSON files in the same way that we use shapefiles. (GeoJSON is easy to generate, and I quite often find myself writing little Ruby scripts that turn some data or other into GeoJSON.) At the same time, I've moved the shapefile code into its own class. It's at https://github.com/systemed/tilemaker/compare/master...geojson

On shutdown, after the mbtiles/pmtiles file has been written, it terminates (on Mac) with:

Filled the tileset with good things at /Users/richard/Maps/planet/oxfordshire.pmtiles
libc++abi.dylib: terminating with uncaught exception of type std::__1::system_error: mutex lock failed: Invalid argument

I've run valgrind over it and that shows:

==57979== Warning: set address range perms: large range [0x11fc54028, 0x12fc54058) (noaccess)
==57979== Invalid read of size 8
==57979==    at 0x100F41137: pthread_mutex_lock (in /usr/lib/system/libsystem_pthread.dylib)
==57979==    by 0x100237AAF: LeasedStore<std::__1::vector<boost::geometry::model::multi_linestring<boost::geometry::model::linestring<boost::geometry::model::d2::point_xy<double, boost::geometry::cs::cartesian>, std::__1::vector, mmap_allocator>, std::__1::vector, mmap_allocator>, std::__1::allocator<boost::geometry::model::multi_linestring<boost::geometry::model::linestring<boost::geometry::model::d2::point_xy<double, boost::geometry::cs::cartesian>, std::__1::vector, mmap_allocator>, std::__1::vector, mmap_allocator> > > >::~LeasedStore() (in /usr/local/bin/tilemaker)
==57979==    by 0x100C580D1: tlv_finalize (in /usr/lib/system/libdyld.dylib)
==57979==    by 0x100D3B6AB: exit (in /usr/lib/system/libsystem_c.dylib)
==57979==    by 0x100C583DB: start (in /usr/lib/system/libdyld.dylib)
==57979==    by 0x2: ???
==57979==    by 0x104F018FE: ???
==57979==    by 0x104F01908: ???
==57979==    by 0x104F0193E: ???
==57979==  Address 0x104ef0e18 is on thread 1's stack
==57979==  67808 bytes below stack pointer
[same error two more times]

libc++abi.dylib: terminating with uncaught exception of type std::__1::system_error: mutex lock failed: Invalid argument
==57979== 
==57979== Process terminating with default action of signal 6 (SIGABRT)
==57979==    at 0x100EF0D46: __pthread_sigmask (in /usr/lib/system/libsystem_kernel.dylib)
[etc.]

So I think it's something to do with what you've described here:

* remove locks from geometry stores

Rather than locking on each store call, threads lease a range of the
ID space for points/lines/multilines/polygons. When the thread ends,
it return the lease.

This has some implications:

- the IDs are no longer guaranteed to be contiguous

- shapefiles are a bit weird, as they're loaded on the main
  thread -- so their lease won't be returned until program
  end. This is fine, just pointing it out.

...and that I've missed something obvious. I can't see what that is (but then it is 1.30am). Any thoughts?

cldellow commented 7 months ago

I gave that a quick whirl and didn't repro a crash (granted, on Linux). If it's easy to replicate the command you're running, I can try that.

Speculation: the LeasedStore destructor is running after the ShpMemTiles destructor.

In the OSM case, the LeasedStore destructor runs when the threads for the various ReadPhases are terminated.

In this case, if the reading of GeoJSON and SHP is happening on the main tilemaker thread, it'll run when the thread ends -- which might be after ShpMemTiles has been destroyed.

The overall design is a bit fragile/footgun-y at the moment.

If that's the case, possible fixes could include:

ensure reading happens on threads that terminate before the main thread
add a destructor to TileDataSource that ensures that the LeasedStore on that thread, if any, has released its leases

cldellow commented 7 months ago

Ah, I repro a different error if I try that branch with some random geojson file:

Reading GeoJSON geojson
Reading shapefile urban_areas
terminate called after throwing an instance of 'std::runtime_error'
  what():  fatal: no available stores to lease

If I then wrap https://github.com/systemed/tilemaker/compare/systemed:ae1981b...systemed:a1976d2# in the world's most pointless thread pool:

void GeoJSONProcessor::read(class LayerDef &layer, uint layerNum) {
    boost::asio::thread_pool pool(1);
    boost::asio::post(pool, [&]() {
        // Parse the JSON file into a RapidJSON document
        ...
    });
    pool.join();
}

it runs as expected.

I think the precise behaviour you'd get would depend on the order your shapefiles or geojson files were processed (which I guess depends on your layer orders).

So I think that confirms that the leases either need to be manually relinquished (if running on the same thread), or the reading needs to happen in a separate thread to ensure the leases get returned.

systemed commented 7 months ago

Thanks! That fixes it - I've used the same thread pool approach as with shapefiles and it's fine now. Will do a bit more testing.

systemed / tilemaker

some memory and concurrency improvements #612