Multithreading: let calling thread do its own share of work

mreineck commented 1 year ago

Up to now, in all parallel sections we distribute work over nthreads waiting threads in a thread pool, while the calling thread idles and waits for the parallel section to finish. This change just uses nthreads-1 threads from the pool and lets the calling thread do a share of the work too. I expect that this could make life easier for the kernel thread scheduler and avoid unnecessary migrations etc. In fact, I observe that the test suite runs considerably faster (10%) after this change on my laptop.

@peterbell10, do you think this is a useful (and safe) change?

mreineck commented 1 year ago

@cantonios can you please also have a look if this interferes with the philosophy of other thread pools? In principle it shouldn't, since we use the calling thread to do the work already when only one thread was requested. I'd just prefer to be safe than sorry :-)

mreineck commented 1 year ago

One more thing: currently the calling thread notifies all workers sequentially at the begin of a parallel region. Since this can apparently be a rather slow operation (1 to 30 microseconds, roughly), could it make sense that every woken-up thread notifies another one by itself (i.e. the starting time could go down from O(nthreads) to O(log(nthreads))?

Obviously this doesn't matter much on small machines, but things may be different on, for example, 256-core AMD nodes ...

mreineck commented 1 year ago

... or maybe even simpler: what would happen if ducc_thread_pool::submit() simply pushed the passed work items onto the overflow queue? Then the worker threads would fetch them "in their own time", and the submitting thread would not have to wait.

cantonios commented 1 year ago

@cantonios can you please also have a look if this interferes with the philosophy of other thread pools? In principle it shouldn't, since we use the calling thread to do the work already when only one thread was requested. I'd just prefer to be safe than sorry :-)

What you have looks fine to me. The main thread is going to be waiting anyways, so may as well have it do work.

cantonios commented 1 year ago

One more thing: currently the calling thread notifies all workers sequentially at the begin of a parallel region. Since this can apparently be a rather slow operation (1 to 30 microseconds, roughly), could it make sense that every woken-up thread notifies another one by itself (i.e. the starting time could go down from O(nthreads) to O(log(nthreads))?

We often find a binary divide-and-conquer approach works best. Submit the first job to the pool for [0, n]. When a thread picks up the job from the pool, it splits it into [0, n/2], [n/2, n], submits one half back to the pool and executes the second half itself (which it may then again divide in half and repeat). Only the main thread waits.

Eigen's thread pool is optimized for this type of division of work though, so your results may vary.

mreineck commented 1 year ago

Thanks, @cantonios!

I just pushed something that causes (for an example of 8 threads) the calling thread to submit threads 4, 2, and 1, and execute chunk 0, thread 4 will submit 6 and 5 and then execute chunk 4, and analogously. I'll have to measure this on a big node, with nthreads=8 I'm not seeing much difference. I have to admit though that the code structure is quite adventurous ...

peterbell10 commented 1 year ago

Maybe I'm misunderstanding but @cantonios are you describing a dynamic work sharing algorithm? The recursive structure doesn't make sense to me for a static distribution since the initial thread is going to have the lowest latency to submit the tasks, so might as well do all of them if it's statically known.

mreineck commented 1 year ago

Maybe I'm misunderstanding but @cantonios are you describing a dynamic work sharing algorithm? The recursive structure doesn't make sense to me for a static distribution since the initial thread is going to have the lowest latency to submit the tasks, so might as well do all of them if it's statically known.

But even for a static distribution, if the initial thread submits all n threads, that's an overhead of O(n) on that single thread; doing things recursively results in a maximum overhead of O(log(n)) (waiting time plus time for sumbission).

... or is there a way of preparing the work function of all threads and then doing a notify_all()? That might be very nice, but I think it will require more extensive changes.

mreineck commented 1 year ago

It seems that there is consensus about the first part of the PR, and since we can revert the second part easily, I'm going to merge this now. Please let's continue the discussion here nevertheless!

cantonios commented 1 year ago

Maybe I'm misunderstanding but @cantonios are you describing a dynamic work sharing algorithm? The recursive structure doesn't make sense to me for a static distribution since the initial thread is going to have the lowest latency to submit the tasks, so might as well do all of them if it's statically known.

We do the same for statically known work distributions as well. The reason has to do with how work is distributed and the use of shared thread pools. In TensorFlow, each thread in the pool has its own work queue that it pushes/pulls work to/from first. The main thread (or any thread outside the pool) distributes work to random thread queues. If a thread no longer has any more work in its queue, then it tries to steal work from another queue. This helps reduce contention and maximize cache re-use.

mreineck / ducc

Multithreading: let calling thread do its own share of work #15