Closed mfbalin closed 1 year ago
I think the current example has another drawback. Each thread is copying the same number of bytes but if the OS schedules them unfairly somehow, there will be a load imbalance, which might affect timing measurements and make it seem like more threads are worse for performance.
I think the current example has another drawback. Each thread is copying the same number of bytes but if the OS schedules them unfairly somehow, there will be a load imbalance, which might affect timing measurements and make it seem like more threads are worse for performance.
These enters the realm of real multi threaded schedulers where you have some form of work stealing as in static_thread_pool
. Would you like to contact me via Discord? My user name is maikel.nadolski
and there is a executor
channel on the #include
Server to discuss P2300 related topics and asking for help regarding the stdexec framework.
I've added the memory pool to constrain the number of submitted read operations to each context. That dramatically improves the initial performance since there is no O(N)
allocation (and iteration) anymore.
Move thread_state construction to thread and use a barrier so that only read time is measured and init is parallelized.
Eliminate the contention in the counters struct.
These two modifications together seem to make the multithreaded time measurement much more stable and performant.