What is the best practice for writing CPU-intensive handlers?

In https://github.com/systemed/tilemaker/discussions/621, I was encouraged to use libosmium. I ultimately decided to go a different route, but am trying to understand how libosmium could have been used.

My understanding is that a common way to use libosmium to read a PBF is to write a Handler. At runtime, this looks like:

1 thread reading the Blobs of the PBF, distributing to
N threads decompressing and parsing the blobs, distributing to
1 thread running the handler to consume the objects

If the handler needs to do an expensive operation, e.g. simplifying a complex geometry, what is the recommended approach?

If I naively do the operation inside the handler, I think I must choose between two results:

the parsing threads fill their output queues up to a maximum limit, and stall until the handler consumes them - resulting in reduced CPU utilization
the parsing threads fill their output queues without bound - resulting in unbounded memory usage

Am I missing another option, where the handler can run directly on the parser threads as they parse the PBF?

Libosmium I/O was designed around the idea that it will hide details of the file format from you. PBF files are block-based, but other OSM file formats are not, so libosmium will hide the blocks from the input file. The idea is that it will give you a stream of OSM objects. For performance reasons though, it gives you blocks (osmium::memory::Buffer) of OSM objects. But those blocks are not necessarily tied to any blocks in the PBF file. And the blocks are really only a way of handling memory management without having to manage memory for each and every OSM object. (An earlier version of Osmium did that and memory management overhead turned out to be too expensive.)

Libosmium I/O is also designed with the idea that writing concurrent code in C++ is hard and that most people don't want to do that. So it gives you that stream of data without you having to think about any concurrency. It will use concurrency behind the scenes for some types of file formats when doing the actual I/O and en/decoding of the data, but hides it from the library user. The library user can write their code without every having to think about concurrency. They can still use concurrency in their code if they want to, but that would be completely separate.

This design has its limitations of course, and you found some of them. I'd love to have a better iterface for all of this that would allow this to be more flexible, but I struggle to come up with a design that would work with all the different file formats and still work better if you want to do some heavy processing in multiple threads. If we wanted to go that route, we'd probably have to use some existing framwork for concurrency, because the C++ standard library is just too limited in that regard. But that adds another dependency and every user would have to fit their code into that framework. Maybe all this could be done with the newer concurrency stuff in the C++ standard library, this is certainly something we could look into. But requiring the newest C++ standard library is also something that would need to be considered carefully.

I hope this has made the background a bit clearer. Your analysis is correct, there is no other way. (Well, you don't need to use handlers, sometimes it is easier to just iterate over all objects in a buffer, or iterate over all objects in a file without a handler, but that is besides the point here.)

Great, thanks for your patience in explaining these things. I now understand that a design decision was to give users of the library a serialized, in-order view of objects, which trades performance for ease-of-use. That makes a lot of sense!

I'd love to have a better iterface for all of this that would allow this to be more flexible

I haven't thought too deeply about this, so please poke holes in it. I wonder if, in addition to the current API surface, it would make sense to offer an interface like:

Reader(const osmium::io::File& file, Handler& handler)

In this mode, Reader would understand that the user wants their handler to be dispatched to immediately upon the availability of OSM objects. Objects would not be queued, and could potentially come out of order. When using this mode, users are responsible for writing thread-safe code.

I'd also propose adding two methods to Handler:

class Handler {
public:
  void start_batch() const noexcept {}
  void end_batch() const noexcept {}
}

These would be called on the start and end of dispatching. In the current world, I'm imagining these would be called:

for non-PBF formats: once on parser thread start, once on parser thread end
for PBF formats: once per PrimitiveBlock processed

(In a future world where the coordinator could distribute work as described in https://github.com/osmcode/libosmium/issues/151#issuecomment-1868051334, they'd be called once per batch of sequential PrimitiveBlocks handled by a given thread).

Then, the programmer who is willing to be concurrency aware in exchange for performance could write a handler like:

thread_local uint64_t tlsNodes = 0;

struct MyHandler : public Handler {
  std::atomic<uint64_t> nodes = 0;

  void node() {
    tlsNodes++;
  }

  void start_batch() {
    tlsNodes = 0;
  }

  void end_batch() {
    nodes += tlsNodes;
  }
};

And in their main code:

MyHandler handler;
Reader reader{"somefile.pbf", handler};
std::cout << "there are " << handler.nodes.load() << " nodes" << std::endl;

I use the count example for simplicity of illustration, but the benefit should be the same whether the workload is small or high -- the tool is always doing useful work, and objects don't need to be buffered.

I believe this interface still hides the details of the underlying file format, and so can apply to PBF, XML, etc. It uses thread_local, but that's from C++11, which libosmium already requires.

There are other approaches - in the example above, I passed a single instance of a handler, and used a thread local to manage local state. Reader could instead take an object that is able to construct instances of handlers. The overall idea is the same - C++ is not my area of expertise, so I'm not sure how best to structure it to give good ergonomics to the user.

The problem with your proposed interface is that it would depend on the actual file format used whether you get any kind of parallel processing out of it. For PBF files this would (sort of) work, but for other file formats it doesn't, because they are not block-based. You would end up with a single thread running the handler because there is only a single thread doing the parsing (plus possibly another thread for gzip decompression). And for the PBF it depends on whether it makes sense to have the "unit of processing" to be the same as the PrimitiveBlocks. That might or might not be the case for different types of workloads.

Reader could instead take an object that is able to construct instances of handlers.

It sounds reasonably easy to do, but doing something like this is quite difficult in C++. The Reader has to push this handler factory down into the PBF implementation and into another thread, which mixes up a lot of code that should be independent of each other. We are getting awfully close to rewriting a thread handling library like TBB. And I certainly don't want to do that.

I think if we can solve the problem that moving the Buffers from one thread to another through that single queue seems to be blocking too much, we can decouple all of this. We'd still move everything through a single thread, but that thread doesn't have to do more than push the buffers on to other threads as it sees fit, if the developer wants this. There would still be some overhead compared to your solution, but it would be much more generic.

Makes sense, thanks for taking the time to respond! I'll close this to avoid cluttering up the repo

osmcode / libosmium

What is the best practice for writing CPU-intensive handlers? #370