systemed / tilemaker

Make OpenStreetMap vector tiles without the stack
https://tilemaker.org/
Other
1.42k stars 228 forks source link

use protozero #623

Closed cldellow closed 7 months ago

cldellow commented 7 months ago

This is meant to be merged after https://github.com/systemed/tilemaker/pull/618. Until then, you can see the protozero-specific diffs here: https://github.com/cldellow/tilemaker/compare/planet-on-32gb...cldellow:protozero?expand=1

As mentioned in that PR, I'm just queuing up the PRs - no urgency to review/merge them.

Use protozero instead of libprotobuf for reading OSM PBFs. We still use libprotobuf for reading/write vector tiles.

PbfReader exposes an API very similar to the one produced by libprotobuf for osmformat.proto. There are some differences:

  1. Better names than libprotobuf's mechanically produced ones, and exposed via range-based for loops, e.g:

for (auto& group : block.groups()) {
  for (auto& node : group.nodes()) { ... }
  for (auto& way : group.ways()) { ... }
  for (auto& rel : group.relations()) { ... }
}
  1. We avoid memory copies, which means lifetime of objects returned from PbfReader is limited. You cannot nest iteration, so code such as this is invalid:
for (auto& node1: group.nodes()) {
  for (auto& node2: group.nodes()) {
    // compare node1 and node2
  }
}
  1. PbfReader handles delta-decoding, so read_pbf's methods no longer need to maintain the running sums.

Also some other, unrelated things I noticed:

Memory use: I had hoped the memory savings would scale with the input PBF, but it appears to be more like a fixed savings of 250MB. Ah, well.

Timings to read the PBF, planet-on-32gb vs this branch:

france, planet-on-32gb: 2m46s vs 2m28s (-11%) great-britain, planet-on-32gb: 58s vs 52s (-10%)

Questions:

  1. I used PbfReader as the class name for the low-level PBF reader that sits on top of protozero. After implementing it and getting its tests passing, I tried to build the whole project...and realized that read_pbf.cpp uses this name already.

I just renamed read_pbf's class to PbfProcessor, in an attempt to capture that it's a bit higher-level. I'm happy to rename pbf_reader's class, though.

Any preferences? My default would be to keep the names as they are, and rename read_pbf.{h,cpp} to pbf_processor.{h,cpp}

systemed commented 7 months ago

Looks excellent (and the code is neater than before)!

I just renamed read_pbf's class to PbfProcessor, in an attempt to capture that it's a bit higher-level. I'm happy to rename pbf_reader's class, though. Any preferences? My default would be to keep the names as they are, and rename read_pbf.{h,cpp} to pbf_processor.{h,cpp}

100% happy with PbfProcessor.

The NA case is heavily affected by the Hudson Bay relation being a straggler.

I'm wondering about this. Is this relation 9441240? None of the tags match (we no longer look for natural=bay since #611) so we shouldn't be assembling the geometry for Hudson Bay. But I've not traced it through.

cldellow commented 7 months ago

100% happy with PbfProcessor.

Done! (...now I see that this loses the symmetry between read_pbf and read_shp. Hmmm.)

The NA case is heavily affected by the Hudson Bay relation being a straggler.

I'm wondering about this. Is this relation 9441240? None of the tags match (we no longer look for natural=bay since https://github.com/systemed/tilemaker/pull/611) so we shouldn't be assembling the geometry for Hudson Bay. But I've not traced it through.

Oops, this was an assumption & error on my part. You're absolutely right--it's not Hudson Bay, it's Lake Huron (1205151). In another branch, I dug a bit further.

Lake Huron is the long pole for relatons - 120sec. The whole read phase takes only 137s, so it really hurts us.

The thrust of my experiments in that branch were to find out if we could save time by starting Lake Huron (and other slow relations) earlier--we control the order in which we iterate the blocks, after all. But at 120 seconds, it dominates the next-slowest relation (Lake Superior, 4039486) by a factor of like 10x. Thus, even if you immediately start processing Lake Huron in ReadPhase::Relations, you'll still be spending the majority of the phase waiting on a single core to finish Lake Huron.

I'm sure there's something that could be done there (optimize correction code? be able to parallelize correction at the per-relation level? ship a corrected Lake Huron?), but I was hoping for something relatively easy with no messy implications, so I gave up. :)

I suspect the most likely next source of perf wins will be vtzero and figuring out a way to reuse simplification results.

systemed commented 7 months ago

I'm sure there's something that could be done there (optimize correction code? be able to parallelize correction at the per-relation level? ship a corrected Lake Huron?), but I was hoping for something relatively easy with no messy implications, so I gave up. :)

Geometry algos like that are strictly "here be dragons" territory. ;) It would be interesting to profile where it's spending most of its time - but I'd not expect any quick wins.

(One possibility for the scenario where you're running tilemaker repeatedly over the same area, and there are complex but rarely-changing geometries like lakes, could be to cache corrected polygons between runs. But that's a lot of over-complication for a pretty niche use case!)

cldellow commented 7 months ago

could be to cache corrected polygons between runs

That's a little reminiscent of https://github.com/systemed/tilemaker/discussions/588. I agree, it's a big hammer for a relatively minor nit, and likely isn't a good trade off in developer and end-user complexity. I probably overweight the impact of these things due to living in Ontario, which has several bonkers geometries that take > 10 seconds to process. (Contrast the UK, where I think the worst relation is the UK boundary itself, which takes 1 second.)

Geometry algos like that are strictly "here be dragons" territory. ;) It would be interesting to profile where it's spending most of its time - but I'd not expect any quick wins.

I've poked around a little bit. It is indeed very daunting!

I loosely group the opportunities into three areas:

  1. Optimize the simplification process itself, maintaining the current result

This is very daunting to me. It resides at the nexus of knowledge of boost::geometry, spatial indexing strategies, insights about geometry generally, and historic issues with simplification (e.g. issues like #428 and #602). It is a field of rakes waiting to be stepped on.

  1. Apply C++ tricks

The stacks often show that time is spent in allocation/deallocation. Simplification seems to create many temporary objects that get discarded almost immediately. If we made boost use a bump allocator that re-used a fixed, long-lived buffer, I think we'd see some runtime improvement.

Given that the boost geometries work happily with the mmap allocator, I believe this ought to be possible. I think the impact on the code could be quite small--a few lines of code, mostly outside of the simplify code. It "just" needs someone who knows what they're doing when it comes to custom allocators, which isn't yet me.

  1. Cheat

When the problem in front of me is hard, I often try to ignore it and solve a different problem. Here, I think an approach like this could work:

A. For high zooms, write unsimplified geometries as usual. B. For zooms where simplification is required, operate on a minimally pre-simplified geometry (say, a geometry that had been simplified to 0.0001).

The spirit of the idea is that we currently repeatedly simplify the full-fidelity base geometry to generate the low-zoom geometry. Most of the cost is in throwing out the first 80% of the points.

But once you're at, say, z10, maybe it's fine to use a lower-fidelity geometry as your input for simplification.

I did a little prototyping of this idea in https://github.com/cldellow/tilemaker/commit/146b692d238a786b3b0822b2c194a86577627536.

The prototyping was just to test the idea, not yet to integrate it into tilemaker. It seemed to show a dramatic improvement in runtime. It does result in different geometries. They seem similar to the geometries that would have been generated using the current approach. Still, they're different, so it's not yet clear to me if this would be acceptable.

It also means that the geometry gets simplified before being converted to tile coordinates, so the issue fixed in #428 might crop up again.

cldellow commented 7 months ago

Oops, I somehow veered from geometry correction to geometry simplification in that comment.

That comment still stands, but is not relevant to the current discussion. :) (Unless, perhaps, the initial simplification pass could happen in the same pass as the correction pass. This would give the other cores something to do when they would otherwise be sitting idle.)

systemed commented 7 months ago

Another great one - thank you. Duly merged!

On the (side) issue of performance with large geometries:

  1. Optimize the simplification process itself, maintaining the current result This is very daunting to me. It resides at the nexus of knowledge of boost::geometry, spatial indexing strategies, insights about geometry generally, and historic issues with simplification (e.g. issues like https://github.com/systemed/tilemaker/pull/428 and https://github.com/systemed/tilemaker/pull/602). It is a field of rakes waiting to be stepped on.

Yep. I absolutely wouldn't suggest getting deep into the algorithms. There are a few places in correct.hpp where I suspect we may be spending a lot of time: the loop in dissolve_generate_rings, and I have historic suspicions about the most efficient way to call boost::geometry::union_ in result_combine. But even if we do look at this - and it's not an immediate priority - then I'd suggest concentrating on any aspects of the implementation that can be fine-tuned, rather than the algorithms themselves.

  1. Apply C++ tricks The stacks often show that time is spent in allocation/deallocation. Simplification seems to create many temporary objects that get discarded almost immediately. If we made boost use a bump allocator that re-used a fixed, long-lived buffer, I think we'd see some runtime improvement.

👍

  1. Cheat [...] I did a little prototyping of this idea in https://github.com/cldellow/tilemaker/commit/146b692d238a786b3b0822b2c194a86577627536. Oops, I somehow veered from geometry correction to geometry simplification in that comment.

Interesting nonetheless! Broadly I think the principle of simplifying already-simplified geometries is sound: yes, they'll differ slightly in output, but the overall fidelity to the original should be similar.