ocaml-multicore / domainslib

Parallel Programming over Domains
ISC License
173 stars 30 forks source link

Utilise effect handlers #51

Closed kayceesrk closed 2 years ago

kayceesrk commented 2 years ago

This PR uses effect handlers to create tasks. The use of effect handlers will fix the issue described in https://github.com/ocaml-multicore/ocaml-multicore/issues/670#issuecomment-938436861. The new test test_deadlock.ml precisely captures the scenario explained in the comment. This test program (after removal of T.run) will deadlock on domainslib.0.3.1 but runs to completion on this branch.

Unfortunately, this introduces a breaking change; the computations need to be enclosed in a run function due to the use of effect handlers. I am also yet to do performance benchmarking of this change.

This PR also makes domainslib work only with 4.12+domains (and OCaml 5.00); the stdlib EffectHandlers module is not made available on 4.12.0+domains+effects. In order to address this, the PR https://github.com/ocaml-multicore/ocaml-multicore/pull/689 adds the EffectHandlers module to 4.12.0+domains+effects.

kayceesrk commented 2 years ago

CC @ctk21 @edwintorok @Sudha247.

kayceesrk commented 2 years ago

Here are performance results from turing (24 isolated x86 cores in a 28-core machine). The variant 4.12+domains uses the domainslib from this PR, and 4.12+domains+effects uses domainslib.0.3.0. The results show that there is negligible performance impact to using effect handlers in domainslib.

image

kayceesrk commented 2 years ago

The commit https://github.com/ocaml-multicore/domainslib/pull/51/commits/d023dc21ca2e609df797caf16c4ddbbb2b05fec6 depends on https://github.com/ocaml-multicore/ocaml-multicore/pull/693.

With this feature, this program, which has an exception being thrown from await prints a backtrace that includes both the awaiting and the awaited task.

Raised at Stdlib.failwith in file "stdlib.ml", line 29, characters 17-33
Called from Backtrace.foo in file "test/backtrace.ml", line 6, characters 11-27 
Called from Backtrace.bar in file "test/backtrace.ml", line 14, characters 4-9
Called from Domainslib__Task.do_task in file "lib/task.ml", line 40, characters 23-29
Re-raised at Stdlib__effectHandlers.Deep.discontinue_with_backtrace.(fun) in file "effectHandlers.ml", line 41, characters 4-38
Called from Domainslib__Task.await in file "lib/task.ml", line 61, characters 6-44
Called from Backtrace.main.(fun) in file "test/backtrace.ml", line 23, characters 4-18
Called from Domainslib__Task.do_task in file "lib/task.ml", line 40, characters 23-29
Re-raised at Domainslib__Task.run.loop in file "lib/task.ml", line 101, characters 23-57
Called from Backtrace.main in file "test/backtrace.ml", line 21, characters 2-135
Called from Backtrace in file "test/backtrace.ml", line 28, characters 6-13

"test/backtrace.ml", line 23 is the await call.

kayceesrk commented 2 years ago

There were a couple of valid concerns raised by @ctk21 about the current implementation:

  1. Work-first or help-first: Under work-first policy, a newly spawned async will be executed by a domain immediately, leaving its continuation to be stolen. Under help-first policy, a newly spawned async will be pushed to the work-stealing queue, and the domain continues executing the original task. Cilk goes for work-first policy, whereas this PR is help-first. It is easy to support both policies (by taking an optional parameter for the policy in async), but we should decide which policy is the one to go for by default. Previous studies indicate that no one policy is better for all the cases [1].
  2. Task hopping between domains: When a task blocks on an await and is woken up, it is put back into the work-stealing queue. This means that the task may switch to a different domain on the same pool at await. This will make it difficult to program with domain-local state. Should an async task be allowed to be pinned to a domain?
  3. Top-level run function: The need to enclose the computations in top-level run function is unwieldy, but looks necessary.

[1] https://www.cs.rice.edu/~yguo/pubs/PID824943.pdf

edwintorok commented 2 years ago

I assume that once a future version of OCaml gets an effect type-system we'd be able to determine at compile time when a handler is missing.

Meanwhile not sure whether there is a type-level solution that is not too invasive (don't want to force ending up with a monad or other solution that requires plumbing a parameter through everything). Perhaps a compromise would be to include a lightweight (let*) and (and*) monad for more easily converting from Lwt style code (I think that let* can be a simple unit -> 'a delay, and only and would actually need to launch things async), and then a way to 'run' that monad would be to call it with the function that installs the effect handler. But that is probably best done as a separate PR.

edwintorok commented 2 years ago

I've tested this on my original program that failed previously, and it works now.

(It is an implementation of find PATH -size +1c -type f, but using parallel for loops and openat/fstatat to walk the tree. The Rust implementation in fd is quite fast too (although at times it is outperformed by the single threaded find on something like /usr due to find using openat/fstatat and fd using only opendir and lstat thus having to look up path components over and over again on deep trees). My OCaml implementation uses openat/fstatat and it seems to beat both find and fd on walking my homedir with a warm cache (with about 13317085 non-empty files in it), looks like quite a good test for IO / syscalls intensive activity:

dune exec --profile=release ./testme.exe /home/edwin > /dev/null  199.19s user 52.84s system 2318% cpu 10.870 total
dune exec --profile=release ./testme.exe /home/edwin > /dev/null  193.58s user 52.49s system 2318% cpu 10.612 total
fd -HI -t f -S +1b -c never . /home/edwin > /dev/null  59.72s user 264.94s system 2182% cpu 14.876 total
find /home/edwin -size +1c -type f 2> /dev/null  4.02s user 19.68s system 99% cpu 23.846 total
13317085

Interestingly the OCaml implementation spends a lot more time in userspace and a lot less time in kernelspace (and less time overall).

With cold caches the benefits of parallelism become more obvious:

sync; sudo sh -c 'echo 3 >/proc/sys/vm/drop_caches'; sync
dune exec --profile=release ./testme.exe /home/edwin > /dev/null  59.28s user 111.25s system 1104% cpu 15.436 total
sync; sudo sh -c 'echo 3 >/proc/sys/vm/drop_caches'; sync
fd -HI -t f -S +1b -c never . /home/edwin > /dev/null  49.85s user 209.28s system 975% cpu 26.564 total
sync; sudo sh -c 'echo 3 >/proc/sys/vm/drop_caches'; sync
find /home/edwin -size +1c -type f 2> /dev/null  5.62s user 52.82s system 33% cpu 2:52.89 total

The speed of Lwt and Luv is rather disappointing though, with warm caches and with not even printing the filenames but just counting them I get something that uses 185% CPU but barely achieves a speedup over the sequential find (this is even if I implement Lwt.Unix.openat):

OCAMLRUNPARAM=b dune exec --profile=release ./testmelwt.exe /home/edwin  65.79s user 176.13s system 185% cpu 2:10.39 total

(I don't have a working Luv version at hand though).

Once I've cleaned up my code I'll post it online for comparison, I was trying to keep the multicore and Lwt/Luv versions quite similar in implementation. It looks like having the ability to execute OCaml code as soon as we get a result from the syscal on the same thread makes a huge difference (otherwise a lot of time is spent sending jobs between threads and doing syscalls to put threads to sleep and wake them again).

kayceesrk commented 2 years ago

@ctk21 wanted the results of the test/task_throughput.ml benchmark. Here are the results.

On master (which doesn't use effect handlers):

$ dune exec test/task_throughput.exe 8 32768 32768
n_iterations: 32768   n_units: 32768  n_domains: 8
Timings (ns): n=32768  mean=26277.7
 (       0,       32):      0
 [      32,       64):      0
 [      64,      128):      0
 [     128,      256):      0
 [     256,      512):      0
 [     512,     1024):      0
 [    1024,     2048):      0
 [    2048,     4096):      0
 [    4096,     8192):      0
 [    8192,    16384):      0
 [   16384,    32768):  31811
 [   32768,    65536):    931
 [   65536,   131072):     23
 [  131072,   262144):      2
 [  262144,   524288):      0
 [  524288,  1048576):      1
 [ 1048576,  2097152):      0
 [ 2097152,  4194304):      0
 [ 4194304,  8388608):      0
 [ 8388608,      Inf):      0

with GC stats:

allocated_words: 39466833
minor_words: 39464751
promoted_words: 20624
major_words: 22706
minor_collections: 239
major_collections: 6
forced_major_collections: 0
heap_words: 190448
top_heap_words: 210928
mean_space_overhead: 2528.876316

On this branch:

$ dune exec test/task_throughput.exe 8 32768 32768
dune exec test/task_throughput.exe 8 32768 32768
n_iterations: 32768   n_units: 32768  n_domains: 8
Timings (ns): n=32768  mean=32750.8
 (       0,       32):      0
 [      32,       64):      0
 [      64,      128):      0
 [     128,      256):      0
 [     256,      512):      0
 [     512,     1024):      0
 [    1024,     2048):      0
 [    2048,     4096):      0
 [    4096,     8192):      0
 [    8192,    16384):      0
 [   16384,    32768):  21733
 [   32768,    65536):  10877
 [   65536,   131072):    151
 [  131072,   262144):      5
 [  262144,   524288):      0
 [  524288,  1048576):      2
 [ 1048576,  2097152):      0
 [ 2097152,  4194304):      0
 [ 4194304,  8388608):      0
 [ 8388608,      Inf):      0

with GC stats:

allocated_words: 154131213
minor_words: 154129199
promoted_words: 79934
major_words: 81948
minor_collections: 779
major_collections: 11
forced_major_collections: 0
heap_words: 227312
top_heap_words: 231408
mean_space_overhead: 1517.998989

The effects version has lower latency than master. The effects version also tends to allocate ~4x more than the master version, almost all of it in the minor heap. It would be worthwhile going through the code to reduce allocations.

jberdine commented 2 years ago

2. Task hopping between domains: When a task blocks on an await and is woken up, it is put back into the work-stealing queue. This means that the task may switch to a different domain on the same pool at await. This will make it difficult to program with domain-local state. Should an async task be allowed to be pinned to a domain?

Is a valid way of thinking about this question that if a program needs "task-local" state, then:

  1. with tasks that are not pinned to domains, the program needs to implement task-local state itself;
  2. with tasks that can be pinned to domains, then domain-local state will also be task-local, and the program has an easy job?

I guess that supporting pinning is somewhat complex, and that it adds scheduling constraints that can reduce performance. On the other hand, if task-local state is needed, would you expect that it might be the most efficient implementation strategy?

avsm commented 2 years ago

@edwintorok wrote:

It looks like having the ability to execute OCaml code as soon as we get a result from the syscal on the same thread makes a huge difference (otherwise a lot of time is spent sending jobs between threads and doing syscalls to put threads to sleep and wake them again).

Does your reimplementation of find use the eio library? That will use io_uring on Linux and should have very little syscall activity indeed. I'd be interested in seeing that vs a direct syscall implementation.

edwintorok commented 2 years ago

On 1 November 2021 15:56:56 GMT, Anil Madhavapeddy @.***> wrote:

@.*** wrote:

It looks like having the ability to execute OCaml code as soon as we get a result from the syscal on the same thread makes a huge difference (otherwise a lot of time is spent sending jobs between threads and doing syscalls to put threads to sleep and wake them again).

Does your reimplementation of find use the eio library? That will use io_uring on Linux and should have very little syscall activity indeed. I'd be interested in seeing that vs a direct syscall implementation.

Not yet, but it looks like io_uring has evolved and can now do openat2 and statx calls too. Would be interesting to patch eio to support those and then try an eio/domainslib hybrid that uses domainslib tasks only for readdir (which has no urging equivalent yet).

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

avsm commented 2 years ago

resolved the conflicts against trunk and deprecation warnings

talex5 commented 2 years ago

Not yet, but it looks like io_uring has evolved and can now do openat2 and statx calls too. Would be interesting to patch eio to support those

Note that ocaml-uring and eio already support openat2: https://github.com/ocaml-multicore/ocaml-uring/blob/main/lib/uring/uring.mli#L106

Patches adding statx are welcome too...

kayceesrk commented 2 years ago

Thanks for the review @ctk21!

@Sudha247 what are your thoughts about the proposal to bump major version. What are the downstream packages that need to be updated?

Sudha247 commented 2 years ago

Agreed, we should bump the major version owing to the significance of this change.

You're right, Sandmark and Lwt_domain needs to be updated at the very least, we can do it immediately after the release. The docs need to be updated too!

kayceesrk commented 2 years ago

The docs need to be updated too!

The domainslib API docs are already updated. Which docs are you referring to @Sudha247?

Sudha247 commented 2 years ago

Ah, I meant https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml :)

Also, after testing this PR with Lwt_domain , it is fine as is and doesn't require any changes, i.e. it works the same with domainslib.0.3.2 and this branch. So, we'll need to update only Sandmark and the above tutorial. Both of which can be done after the new release. I think we are good to merge this one.

kayceesrk commented 2 years ago

Thanks. Merging now.