Closed kayceesrk closed 2 years ago
CC @ctk21 @edwintorok @Sudha247.
Here are performance results from turing
(24 isolated x86 cores in a 28-core machine). The variant 4.12+domains
uses the domainslib
from this PR, and 4.12+domains+effects
uses domainslib.0.3.0
. The results show that there is negligible performance impact to using effect handlers in domainslib.
The commit https://github.com/ocaml-multicore/domainslib/pull/51/commits/d023dc21ca2e609df797caf16c4ddbbb2b05fec6 depends on https://github.com/ocaml-multicore/ocaml-multicore/pull/693.
With this feature, this program, which has an exception being thrown from await prints a backtrace that includes both the awaiting and the awaited task.
Raised at Stdlib.failwith in file "stdlib.ml", line 29, characters 17-33
Called from Backtrace.foo in file "test/backtrace.ml", line 6, characters 11-27
Called from Backtrace.bar in file "test/backtrace.ml", line 14, characters 4-9
Called from Domainslib__Task.do_task in file "lib/task.ml", line 40, characters 23-29
Re-raised at Stdlib__effectHandlers.Deep.discontinue_with_backtrace.(fun) in file "effectHandlers.ml", line 41, characters 4-38
Called from Domainslib__Task.await in file "lib/task.ml", line 61, characters 6-44
Called from Backtrace.main.(fun) in file "test/backtrace.ml", line 23, characters 4-18
Called from Domainslib__Task.do_task in file "lib/task.ml", line 40, characters 23-29
Re-raised at Domainslib__Task.run.loop in file "lib/task.ml", line 101, characters 23-57
Called from Backtrace.main in file "test/backtrace.ml", line 21, characters 2-135
Called from Backtrace in file "test/backtrace.ml", line 28, characters 6-13
"test/backtrace.ml", line 23
is the await
call.
There were a couple of valid concerns raised by @ctk21 about the current implementation:
async
), but we should decide which policy is the one to go for by default. Previous studies indicate that no one policy is better for all the cases [1].I assume that once a future version of OCaml gets an effect type-system we'd be able to determine at compile time when a handler is missing.
Meanwhile not sure whether there is a type-level solution that is not too invasive (don't want to force ending up with a monad or other solution that requires plumbing a parameter through everything). Perhaps a compromise would be to include a lightweight (let*)
and (and*)
monad for more easily converting from Lwt style code (I think that let*
can be a simple unit -> 'a
delay, and only and
would actually need to launch things async), and then a way to 'run' that monad would be to call it with the function that installs the effect handler. But that is probably best done as a separate PR.
I've tested this on my original program that failed previously, and it works now.
(It is an implementation of find PATH -size +1c -type f
, but using parallel for loops and openat
/fstatat
to walk the tree.
The Rust implementation in fd
is quite fast too (although at times it is outperformed by the single threaded find
on something like /usr
due to find
using openat
/fstatat
and fd
using only opendir
and lstat
thus having to look up path components over and over again on deep trees).
My OCaml implementation uses openat
/fstatat
and it seems to beat both find
and fd
on walking my homedir with a warm cache (with about 13317085 non-empty files in it), looks like quite a good test for IO / syscalls intensive activity:
dune exec --profile=release ./testme.exe /home/edwin > /dev/null 199.19s user 52.84s system 2318% cpu 10.870 total
dune exec --profile=release ./testme.exe /home/edwin > /dev/null 193.58s user 52.49s system 2318% cpu 10.612 total
fd -HI -t f -S +1b -c never . /home/edwin > /dev/null 59.72s user 264.94s system 2182% cpu 14.876 total
find /home/edwin -size +1c -type f 2> /dev/null 4.02s user 19.68s system 99% cpu 23.846 total
13317085
Interestingly the OCaml implementation spends a lot more time in userspace and a lot less time in kernelspace (and less time overall).
With cold caches the benefits of parallelism become more obvious:
sync; sudo sh -c 'echo 3 >/proc/sys/vm/drop_caches'; sync
dune exec --profile=release ./testme.exe /home/edwin > /dev/null 59.28s user 111.25s system 1104% cpu 15.436 total
sync; sudo sh -c 'echo 3 >/proc/sys/vm/drop_caches'; sync
fd -HI -t f -S +1b -c never . /home/edwin > /dev/null 49.85s user 209.28s system 975% cpu 26.564 total
sync; sudo sh -c 'echo 3 >/proc/sys/vm/drop_caches'; sync
find /home/edwin -size +1c -type f 2> /dev/null 5.62s user 52.82s system 33% cpu 2:52.89 total
The speed of Lwt and Luv is rather disappointing though, with warm caches and with not even printing the filenames but just counting them I get something that uses 185% CPU but barely achieves a speedup over the sequential find (this is even if I implement Lwt.Unix.openat):
OCAMLRUNPARAM=b dune exec --profile=release ./testmelwt.exe /home/edwin 65.79s user 176.13s system 185% cpu 2:10.39 total
(I don't have a working Luv version at hand though).
Once I've cleaned up my code I'll post it online for comparison, I was trying to keep the multicore and Lwt/Luv versions quite similar in implementation. It looks like having the ability to execute OCaml code as soon as we get a result from the syscal on the same thread makes a huge difference (otherwise a lot of time is spent sending jobs between threads and doing syscalls to put threads to sleep and wake them again).
@ctk21 wanted the results of the test/task_throughput.ml
benchmark. Here are the results.
On master
(which doesn't use effect handlers):
$ dune exec test/task_throughput.exe 8 32768 32768
n_iterations: 32768 n_units: 32768 n_domains: 8
Timings (ns): n=32768 mean=26277.7
( 0, 32): 0
[ 32, 64): 0
[ 64, 128): 0
[ 128, 256): 0
[ 256, 512): 0
[ 512, 1024): 0
[ 1024, 2048): 0
[ 2048, 4096): 0
[ 4096, 8192): 0
[ 8192, 16384): 0
[ 16384, 32768): 31811
[ 32768, 65536): 931
[ 65536, 131072): 23
[ 131072, 262144): 2
[ 262144, 524288): 0
[ 524288, 1048576): 1
[ 1048576, 2097152): 0
[ 2097152, 4194304): 0
[ 4194304, 8388608): 0
[ 8388608, Inf): 0
with GC stats:
allocated_words: 39466833
minor_words: 39464751
promoted_words: 20624
major_words: 22706
minor_collections: 239
major_collections: 6
forced_major_collections: 0
heap_words: 190448
top_heap_words: 210928
mean_space_overhead: 2528.876316
On this branch:
$ dune exec test/task_throughput.exe 8 32768 32768
dune exec test/task_throughput.exe 8 32768 32768
n_iterations: 32768 n_units: 32768 n_domains: 8
Timings (ns): n=32768 mean=32750.8
( 0, 32): 0
[ 32, 64): 0
[ 64, 128): 0
[ 128, 256): 0
[ 256, 512): 0
[ 512, 1024): 0
[ 1024, 2048): 0
[ 2048, 4096): 0
[ 4096, 8192): 0
[ 8192, 16384): 0
[ 16384, 32768): 21733
[ 32768, 65536): 10877
[ 65536, 131072): 151
[ 131072, 262144): 5
[ 262144, 524288): 0
[ 524288, 1048576): 2
[ 1048576, 2097152): 0
[ 2097152, 4194304): 0
[ 4194304, 8388608): 0
[ 8388608, Inf): 0
with GC stats:
allocated_words: 154131213
minor_words: 154129199
promoted_words: 79934
major_words: 81948
minor_collections: 779
major_collections: 11
forced_major_collections: 0
heap_words: 227312
top_heap_words: 231408
mean_space_overhead: 1517.998989
The effects version has lower latency than master
. The effects version also tends to allocate ~4x more than the master
version, almost all of it in the minor heap. It would be worthwhile going through the code to reduce allocations.
2. Task hopping between domains: When a task blocks on an await and is woken up, it is put back into the work-stealing queue. This means that the task may switch to a different domain on the same pool at await. This will make it difficult to program with domain-local state. Should an async task be allowed to be pinned to a domain?
Is a valid way of thinking about this question that if a program needs "task-local" state, then:
I guess that supporting pinning is somewhat complex, and that it adds scheduling constraints that can reduce performance. On the other hand, if task-local state is needed, would you expect that it might be the most efficient implementation strategy?
@edwintorok wrote:
It looks like having the ability to execute OCaml code as soon as we get a result from the syscal on the same thread makes a huge difference (otherwise a lot of time is spent sending jobs between threads and doing syscalls to put threads to sleep and wake them again).
Does your reimplementation of find use the eio
library? That will use io_uring on Linux and should have very little syscall activity indeed. I'd be interested in seeing that vs a direct syscall implementation.
On 1 November 2021 15:56:56 GMT, Anil Madhavapeddy @.***> wrote:
@.*** wrote:
It looks like having the ability to execute OCaml code as soon as we get a result from the syscal on the same thread makes a huge difference (otherwise a lot of time is spent sending jobs between threads and doing syscalls to put threads to sleep and wake them again).
Does your reimplementation of find use the
eio
library? That will use io_uring on Linux and should have very little syscall activity indeed. I'd be interested in seeing that vs a direct syscall implementation.
Not yet, but it looks like io_uring
has evolved and can now do openat2
and statx
calls too. Would be interesting to patch eio
to support those and then try an eio/domainslib hybrid that uses domainslib tasks only for readdir (which has no urging equivalent yet).
-- Sent from my Android device with K-9 Mail. Please excuse my brevity.
resolved the conflicts against trunk and deprecation warnings
Not yet, but it looks like
io_uring
has evolved and can now doopenat2
andstatx
calls too. Would be interesting to patcheio
to support those
Note that ocaml-uring and eio already support openat2: https://github.com/ocaml-multicore/ocaml-uring/blob/main/lib/uring/uring.mli#L106
Patches adding statx
are welcome too...
Thanks for the review @ctk21!
@Sudha247 what are your thoughts about the proposal to bump major version. What are the downstream packages that need to be updated?
Agreed, we should bump the major version owing to the significance of this change.
You're right, Sandmark and Lwt_domain needs to be updated at the very least, we can do it immediately after the release. The docs need to be updated too!
The docs need to be updated too!
The domainslib API docs are already updated. Which docs are you referring to @Sudha247?
Ah, I meant https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml :)
Also, after testing this PR with Lwt_domain
, it is fine as is and doesn't require any changes, i.e. it works the same with domainslib.0.3.2 and this branch. So, we'll need to update only Sandmark and the above tutorial. Both of which can be done after the new release. I think we are good to merge this one.
Thanks. Merging now.
This PR uses effect handlers to create tasks. The use of effect handlers will fix the issue described in https://github.com/ocaml-multicore/ocaml-multicore/issues/670#issuecomment-938436861. The new test
test_deadlock.ml
precisely captures the scenario explained in the comment. This test program (after removal ofT.run
) will deadlock ondomainslib.0.3.1
but runs to completion on this branch.Unfortunately, this introduces a breaking change; the computations need to be enclosed in a
run
function due to the use of effect handlers. I am also yet to do performance benchmarking of this change.This PR also makes domainslib work only with
4.12+domains
(and OCaml 5.00); the stdlibEffectHandlers
module is not made available on4.12.0+domains+effects
. In order to address this, the PR https://github.com/ocaml-multicore/ocaml-multicore/pull/689 adds theEffectHandlers
module to4.12.0+domains+effects
.