mreineck / ducc

Fork of https://gitlab.mpcdf.mpg.de/mtr/ducc to simplify external contributions
GNU General Public License v2.0
13 stars 12 forks source link

Switching the threading component to C++20's atomic synchronization? #24

Open mreineck opened 6 months ago

mreineck commented 6 months ago

As part of an experiment I recently switched from ducc's latch class to C++20's std::latch, and to my surprise I noticed a significant speedup when benchmarking the overhead when submitting work items to the thread pool.

Looking at the source code it sees that std::latch (at least the one from libstdc++) uses atomic synchronization (see, e.g. https://developers.redhat.com/articles/2022/12/06/implementing-c20-atomic-waiting-libstdc#). I wonder if it might be useful to switch the rest of ducc's multithreading component to this mechanism as well.

What's your opinion, @peterbell10 ?

peterbell10 commented 6 months ago

That makes sense to me, std::latch is likely to be higher power overhead while waiting (i.e. less time in sleep) but when the waiting thread also has work to do then you don't spend that much time waiting so it's going to be worth it.

mreineck commented 6 months ago

Thanks! You are certainly right, any approach with (partial) busy waiting will require more power and is likely only useful in a scenario where each thread is guaranteed to have a dedicated core on which it runs (e.g. large non-interactive scientific calculations run on a batch system). Since the whole thing would require switching ducc to C++20 it will probably not go into the main branch very soon, but it's nice to explore what can be achieved with it. My current version is on the crazy_threading branch where I

Using this I managed to reduce the time spent for executing an empty parallel region from roughly 30 microseconds to 1.6 microseconds on the 16 hardware threads of my laptop. The whole thing was prompted by the sub-optimal performance of the ducc FFT in small 2D and 3D transforms shown at https://github.com/blackwer/fft_bench. The results look much better now (in my local tests), but I'm not sure if this warrants such a fundamental change.

mreineck commented 6 months ago

There is one problematic thing with the change: for some reason, the CI runs on MacOS become extremely slow and are cancelled before the tests can finish. I watched the run, and I'm fairly certain that there are no deadlocks; but still the whole test script slows down to a crawl (see, e.g. https://github.com/mreineck/ducc/actions/runs/7931217750). I have no idea what's causing this and whether it's a property of the testing environment...