ocaml-multicore / domainslib

Parallel Programming over Domains
ISC License
171 stars 30 forks source link

Task stealing with CL deques #29

Closed ctk21 closed 3 years ago

ctk21 commented 3 years ago

This PR is an implementation of Chase Lev deques organised for task stealing; as is traditional in Cilk, Concurrent bags, etc.

The basic idea is:

Early benchmarking results (detuned Zen2 with sandmark fb2a38) are encouraging: 20210519_big_deque_time 20210519_big_deque_speedup

I wouldn't say the implementation is fully finished:

_The CL deque implementation was built on the queue in lockfree but I would recommend looking at the C code in Correct and efficient work-stealing for weak memory models_

ctk21 commented 3 years ago

(rebased to pick up #30)

ctk21 commented 3 years ago

I had a think about formalizing how multi_channel is turning a non blocking structure into a blocking one. If Thread A is enqueuing data and Thread B is dequeuing then

Thread A:                                    Thread B:
A1 - enqueue data (non-blocking)             B1 - poll for data (non-blocking)
A2 - check waiter queue and signal           B2 - enqueue ourselves on the waiter queue
     any blocking waiter                     B3 - poll for data again (non-blocking) 
                                                  and block with waiter structure

Each of A1, A2, B1, B2, B3 are ordered in the sense that completion is either Ax < By or Bx < Ay. I think we can then construct the following argument to show that Thread B will always pick up Thread A's enqueue:

if A1 < B1 then B will pick up the data in B1
else B1 < A1 then 
   if B2 < A1 then A will signal B in A2 and the data will be collected by B
   else A1 < B2 then B3 is will execute after A1 and so B will pick up the data
ctk21 commented 3 years ago

I've implemented two improvements:

The big benchmarks now look like this; Zen2 (not isolated) sandmark ed00c5, baseline is bc7c44, this pr is e69db8: 20210526_sherwood_big_domainslib_pr29_time 20210526_sherwood_big_domainslib_pr29_speedup

Broad brush, if you want to scale the task stealing is the way to go.

The benchmarks for the standard sizes show: 20210526_sherwood_pr29_time 20210526_sherwood_pr29_speedup

There is a single regression for our evolutionary_algorithm case. I spent some time with this one, there are two things to note:

I think we could do things about the first: we could throttle how we wake up blocked waiters and also limit the number of stealing domains. I'm in two minds about if we should pursue that in this PR or as follow ups.

ctk21 commented 3 years ago

So I did a bit more investigation on the evolutionary_algorithm results and found a memory bug. The tasks in the deque arrays will be overwritten, however all the memory held in the task closures are kept there until the memory in the deque array is overwritten.

I have implemented a fix where the deque array elements hold a reference to the item; this reference is clobbered when taking tasks. You can't touch the array itself in steal because the producer may reuse that slot.

A rerun of the big benchmarks got me: 20210527_ws_deque_gc_fix_big_time 20210527_ws_deque_gc_fix_big_speedup

A rerun of the standard size benchmarks got me: 20210527_ws_deque_gc_fix_time 20210527_ws_deque_gc_fix_speedup

There remains something a bit odd about evolutionary algorithm that I don't fully understand - it looks like we do differing amounts of GC work with the old vs new code. The deques are causing about 20% more minor GCs, which is odd as I can't see significant extra allocations in the domainslib structures themselves: 20210527_ws_deque_ea_mgc

kayceesrk commented 3 years ago

I feel we can lean on the OCaml multicore memory model more in the deque rather than wrapping every operation in an Atomic indirection.

Just to connect various folks who might be interested in this, @fpottier and his students were looking at instances where Multicore OCaml explicitly takes advantage of data racy behaviours for improving performance. This looks like one. But @ctk21 mentioned that we need acquire release atomics which the OCaml memory model does not support (yet). @stedolan mentioned that while acquire load fits nicely into the model, store-release does not.

kayceesrk commented 3 years ago

Can you merge with the master again? The diff includes changes already included in master, and it makes it hard to read. See https://github.com/ocaml-multicore/domainslib/pull/29/files#diff-1a51aea47d0ef65c5073e293d19298451614febc83360af128ded914dbb6599bL39

ctk21 commented 3 years ago

Can you merge with the master again? The diff includes changes already included in master, and it makes it hard to read. See https://github.com/ocaml-multicore/domainslib/pull/29/files#diff-1a51aea47d0ef65c5073e293d19298451614febc83360af128ded914dbb6599bL39

Rebased to master and pushed!

ctk21 commented 3 years ago

I feel we can lean on the OCaml multicore memory model more in the deque rather than wrapping every operation in an Atomic indirection.

Just to connect various folks who might be interested in this, @fpottier and his students were looking at instances where Multicore OCaml explicitly takes advantage of data racy behaviours for improving performance. This looks like one. But @ctk21 mentioned that we need acquire release atomics which the OCaml memory model does not support (yet). @stedolan mentioned that while acquire load fits nicely into the model, store-release does not.

To expand a bit. We can implement the CL deque using the Atomic module and the existing OCaml multicore memory-model. However if you look at Figure 1 in Correct and efficient work-stealing for weak memory models, then we aren't able to express the acquire-release components of that implementation in OCaml multicore today.

In other parallel data structures, for example a SPSC ring buffer, using acquire-release can be important to get a really low overhead implementation. However there is a problem to solve which is if the OCaml multicore memory-model can be extended to accommodate acquire-release atomics.