google / marl

A hybrid thread / fiber task scheduler written in C++ 11
Apache License 2.0
1.84k stars 191 forks source link

Optimize scheduler queue data structures to reduce cache misses and fragmentation #129

Open benvanik opened 4 years ago

benvanik commented 4 years ago

From the TODO here: https://github.com/google/marl/blob/0591f0ef0811eff9e87c22048abc048c0399862d/include/marl/scheduler.h#L283-L287

and this: https://github.com/google/marl/blob/0591f0ef0811eff9e87c22048abc048c0399862d/include/marl/scheduler.h#L436

unordered_map and deque aren't great for memory locality and heap fragmentation, and currently these are not using the marl Allocator so they are coming right from the system allocator. It'd be good to evaluate some alternative data structures (adding to or reusing those already in https://github.com/google/marl/blob/master/include/marl/containers.h) and pooling.

ben-clayton commented 4 years ago

This could be broken down into two issues:

  1. We're performing 'system' heap allocations outside of the Allocator interface.
  2. unordered_map and deque aren't great for memory locality and heap fragmentation

I definitely agree with 1, and could be quickly fixed by implementing a std::allocator that wraps the marl::Allocator. I've created #131 to track this. I'd like to get to the point that we can run all the unit tests and examples without a single heap allocation that wasn't performed by the marl::Allocator.

As for 2, I don't disagree, but I've been using various profilers and benchmarks for guidance here. The slow replacement of std containers with those in marl/containers.h have all been done when they appeared hot in profiles. I'd suspect that you'll find other, more significant optimisation opportunities before you find these to be the bottleneck. If you can show they are a major performance sponge, I'm more than happy to try replacing them.

benvanik commented 4 years ago

For 2 I was mainly just going off the TODO in the code and have no empirical results yet to show why anything different would be worth it besides the "doing zero work and not thrashing cache is always cheaper than doing any work" :)

Part of this would probably be adding more test cases to scheduler_bench.cpp so that we can stress those structures a bit more (in terms of having a large number of waiters, etc).