Task stealing with CL deques

ctk21 commented 3 years ago

This PR is an implementation of Chase Lev deques organised for task stealing; as is traditional in Cilk, Concurrent bags, etc.

The basic idea is:

each domain in the task pool owns a CL deque; it can call push without synchronous operations and pop will not need to synchronize with other domains unless there is only 1 element in the deque.
if a domain has no tasks, it will steal tasks from randomly chosen deques belonging to other domains.
there is a mechanism for domains to block when they discover there is no work in any queue by posting a message to a 'waiter channel'. Importantly when all domains are awake, the only extra overhead is to check that the 'waiter channel' is empty.
the implementation can also be tweaked to operate in a 'non-blocking' mode where you spin rather than block in the operating system by setting recv_block_spins.

Early benchmarking results (detuned Zen2 with sandmark fb2a38) are encouraging: 20210519_big_deque_time 20210519_big_deque_speedup

I wouldn't say the implementation is fully finished:

I feel we can lean on the OCaml multicore memory model more in the deque rather than wrapping every operation in an Atomic indirection.
It might be worth making the implementation full garbage free. We could remove the Some/None return from the deque to use exceptions.
While the benchmark results are good for many cases (LU_decomposition and game_of_life stand out), it would be good to understand the matrix_multiplication regression and take a look at the binarytrees5 where the deques might be a bit slower.

_The CL deque implementation was built on the queue in lockfree but I would recommend looking at the C code in Correct and efficient work-stealing for weak memory models_

ctk21 commented 3 years ago

(rebased to pick up #30)

ctk21 commented 3 years ago

I had a think about formalizing how multi_channel is turning a non blocking structure into a blocking one. If Thread A is enqueuing data and Thread B is dequeuing then

Thread A:                                    Thread B:
A1 - enqueue data (non-blocking)             B1 - poll for data (non-blocking)
A2 - check waiter queue and signal           B2 - enqueue ourselves on the waiter queue
     any blocking waiter                     B3 - poll for data again (non-blocking) 
                                                  and block with waiter structure

Each of A1, A2, B1, B2, B3 are ordered in the sense that completion is either Ax < By or Bx < Ay. I think we can then construct the following argument to show that Thread B will always pick up Thread A's enqueue:

if A1 < B1 then B will pick up the data in B1
else B1 < A1 then 
   if B2 < A1 then A will signal B in A2 and the data will be collected by B
   else A1 < B2 then B3 is will execute after A1 and so B will pick up the data

ctk21 commented 3 years ago

I've implemented two improvements:

we now don't do Atomic.set on the elements of the backing array in the deque
I've made the ws_deque garbage free

The big benchmarks now look like this; Zen2 (not isolated) sandmark ed00c5, baseline is bc7c44, this pr is e69db8: 20210526_sherwood_big_domainslib_pr29_time

Broad brush, if you want to scale the task stealing is the way to go.

The benchmarks for the standard sizes show: 20210526_sherwood_pr29_time 20210526_sherwood_pr29_speedup

There is a single regression for our evolutionary_algorithm case. I spent some time with this one, there are two things to note:

this benchmark does not scale as it has lots of serial sections; the task queue flips from being run to all blocked to running to all blocked.
this benchmark has safepoint issues; it makes use of iterations and folds using stdlib functions which in the current multicore can put big gaps between safepoints.

I think we could do things about the first: we could throttle how we wake up blocked waiters and also limit the number of stealing domains. I'm in two minds about if we should pursue that in this PR or as follow ups.

ctk21 commented 3 years ago

So I did a bit more investigation on the evolutionary_algorithm results and found a memory bug. The tasks in the deque arrays will be overwritten, however all the memory held in the task closures are kept there until the memory in the deque array is overwritten.

I have implemented a fix where the deque array elements hold a reference to the item; this reference is clobbered when taking tasks. You can't touch the array itself in steal because the producer may reuse that slot.

A rerun of the big benchmarks got me: 20210527_ws_deque_gc_fix_big_time 20210527_ws_deque_gc_fix_big_speedup

A rerun of the standard size benchmarks got me: 20210527_ws_deque_gc_fix_time 20210527_ws_deque_gc_fix_speedup

There remains something a bit odd about evolutionary algorithm that I don't fully understand - it looks like we do differing amounts of GC work with the old vs new code. The deques are causing about 20% more minor GCs, which is odd as I can't see significant extra allocations in the domainslib structures themselves: 20210527_ws_deque_ea_mgc

kayceesrk commented 3 years ago

I feel we can lean on the OCaml multicore memory model more in the deque rather than wrapping every operation in an Atomic indirection.

Just to connect various folks who might be interested in this, @fpottier and his students were looking at instances where Multicore OCaml explicitly takes advantage of data racy behaviours for improving performance. This looks like one. But @ctk21 mentioned that we need acquire release atomics which the OCaml memory model does not support (yet). @stedolan mentioned that while acquire load fits nicely into the model, store-release does not.

kayceesrk commented 3 years ago

Can you merge with the master again? The diff includes changes already included in master, and it makes it hard to read. See https://github.com/ocaml-multicore/domainslib/pull/29/files#diff-1a51aea47d0ef65c5073e293d19298451614febc83360af128ded914dbb6599bL39

ctk21 commented 3 years ago

Can you merge with the master again? The diff includes changes already included in master, and it makes it hard to read. See https://github.com/ocaml-multicore/domainslib/pull/29/files#diff-1a51aea47d0ef65c5073e293d19298451614febc83360af128ded914dbb6599bL39

Rebased to master and pushed!

ctk21 commented 3 years ago

I feel we can lean on the OCaml multicore memory model more in the deque rather than wrapping every operation in an Atomic indirection.

Just to connect various folks who might be interested in this, @fpottier and his students were looking at instances where Multicore OCaml explicitly takes advantage of data racy behaviours for improving performance. This looks like one. But @ctk21 mentioned that we need acquire release atomics which the OCaml memory model does not support (yet). @stedolan mentioned that while acquire load fits nicely into the model, store-release does not.

To expand a bit. We can implement the CL deque using the Atomic module and the existing OCaml multicore memory-model. However if you look at Figure 1 in Correct and efficient work-stealing for weak memory models, then we aren't able to express the acquire-release components of that implementation in OCaml multicore today.

In other parallel data structures, for example a SPSC ring buffer, using acquire-release can be important to get a really low overhead implementation. However there is a problem to solve which is if the OCaml multicore memory-model can be extended to accommodate acquire-release atomics.

ocaml-multicore / domainslib

Task stealing with CL deques #29