Closed talex5 closed 12 months ago
I tested a deployment going via the real scheduler (using the solver-test
pool) on arm64-jade-1 (thanks to @mtelvers for help setting it up!). It's a little hard to get accurate benchmarks because the machine is also being used for builds at the same time, but here are my results.
All tests are using the new stress.exe:
dune exec -- stress/stress.exe cluster talex5.cap --count=50 --solver-pool=solver-test
Old solver (/usr/local/bin/solver-worker -c /etc/ocluster/solver.cap --name='arm64-jade-1.equinix.ci.dev' --state-dir=/var/cache/solver --capacity=40 --internal-workers=160 --verbosity=info
):
Solved warm-up requests in: 65.90s
Running another 50 solves...
50/50 complete
Solved 50 requests in 17.43s (0.35s/iter)
The new solver can't use 160 cores (OCaml 5 is limited to 128 domains), but in practice anything above 6 seems to give rapidly diminishing returns anyway. However, we can run multiple instances of the new version. With three instances using 6 worker domains each, I get (/usr/local/bin/new/solver-service run-cluster -c /etc/ocluster/solver.cap --name='arm64-jade-{1,1b,1c}.equinix.ci.dev' --cache-dir=/var/cache/solver{,2,3} --capacity=12 --internal-workers=6 --verbosity=info
):
Solved warm-up requests in: 24.80s
Running another 50 solves...
50/50 complete
Solved 50 requests in 17.84s (0.36s/iter)
For comparison, here's the new solver with a single 18-core instance (--capacity=36 --internal-workers=18
):
Solved warm-up requests in: 37.22s
Running another 50 solves...
50/50 complete
Solved 50 requests in 38.85s (0.78s/iter)
Cold-start time (loading a new opam-repository Git commit) is always better with the new version.
In summary: the old version scales better with more cores, but the new version can get the same performance with fewer cores in total if we run more than one instance. Note: I didn't try optimising the old version (e.g. running with fewer workers); it's possible that's not optimal either.
Given that we only have one solver service in the prod cluster at the moment (but using a large number of cores), it would probably already be a benefit to deploy several instances of the new solver, using just a few cores on each machine.
Updated benchmarks, with the machine otherwise idle:
The old server in its original configuration (up to 160 worker processes, though probably only 80 being used at once due to capacity 40 and 2 platforms per stress request):
# /usr/local/bin/old/solver-worker -c /etc/ocluster/solver.cap --name='arm64-jade-1.equinix.ci.dev' --state-dir=/var/cache/solver --capacity=40 --internal-workers=160 --verbosity=info
Solved warm-up requests in: 65.90s
Running another 50 solves...
50/50 complete
Solved 50 requests in 17.06s (0.34s/iter)
A single new solver with 6 worker domains:
# /usr/local/bin/new/solver-service run-cluster -c /etc/ocluster/solver.cap --name='arm64-jade-1.equinix.ci.dev' --cache-dir=/var/cache/solver --capacity=12 --internal-workers=6 --verbosity=info
Solved warm-up requests in: 44.86s
Running another 50 solves...
50/50 complete
Solved 50 requests in 45.44s (0.91s/iter)
A single new solver with 5 worker domains is faster:
# /usr/local/bin/new/solver-service run-cluster -c /etc/ocluster/solver.cap --name='arm64-jade-1.equinix.ci.dev' --cache-dir=/var/cache/solver --capacity=10 --internal-workers=5 --verbosity=info
Solved warm-up requests in: 38.07s
Running another 50 solves...
50/50 complete
Solved 50 requests in 35.43s (0.71s/iter)
A single new solver with 4 worker domains is faster still:
# /usr/local/bin/new/solver-service run-cluster -c /etc/ocluster/solver.cap --name='arm64-jade-1.equinix.ci.dev' --cache-dir=/var/cache/solver --capacity=8 --internal-workers=4 --verbosity=info
Solved warm-up requests in: 36.42s
Running another 50 solves...
50/50 complete
Solved 50 requests in 33.51s (0.67s/iter)
3 worker domains is slower, though:
# /usr/local/bin/new/solver-service run-cluster -c /etc/ocluster/solver.cap --name='arm64-jade-1.equinix.ci.dev' --cache-dir=/var/cache/solver --capacity=6 --internal-workers=3 --verbosity=info
Solved warm-up requests in: 43.55s
Running another 50 solves...
50/50 complete
Solved 50 requests in 41.76s (0.84s/iter)
2 instances of the new solver, 4 worker domains each:
Solved warm-up requests in: 23.63s
Running another 50 solves...
50/50 complete
Solved 50 requests in 18.21s (0.36s/iter)
3 instances of the new solver, 4 worker domains each, is faster than the old solver and using fewer cores:
Solved warm-up requests in: 20.90s
Running another 50 solves...
50/50 complete
Solved 50 requests in 14.79s (0.30s/iter)
Can the new solver instances be distributed across machines? Alpha said that the original solver could only have a single instance running servicing the solver pool. With that restriction removed and lower cpu requirements, could we deploy a solver work to many machines in the cluster along side the normal worker?
Can the new solver instances be distributed across machines? Alpha said that the original solver could only have a single instance running servicing the solver pool.
I'm not aware of any such restriction (which would seem to defeat the purpose of using the cluster). Perhaps @moyodiallo remembers what the problem was?
Can the new solver instances be distributed across machines? Alpha said that the original solver could only have a single instance running servicing the solver pool.
@mtelvers are you talking about the pool of process behind the service ? But with the old version it possible to have more than one solver-worker
connected to the scheduler.
could we deploy a solver work to many machines in the cluster along side the normal worker?
Yes we can.
@talex5 Thanks for the port, I read the code it's wonderful. Easier to read than the old version.
There's just one last thing, after that we can deploy it on the prod. It is about the switch when a job is canceled.
Can the new solver instances be distributed across machines? Alpha said that the original solver could only have a single instance running servicing the solver pool.
@mtelvers are you talking about the pool of process behind the service ? But with the old version it possible to have more than one
solver-worker
connected to the scheduler.
OK. It is different to what I remember, I originally wanted to use both jade-1 and jade-2 to distribute the solver service, but you advised that it was not possible. However, these are excellent developments!
OK. It is different to what I remember, I originally wanted to use both jade-1 and jade-2 to distribute the solver service, but you advised that it was not possible. However, these are excellent developments!
Yeah my bad, this is because we was hitting on some bugs, more easier to track with one machine.
There's just one last thing, after that we can deploy it on the prod. It is about the switch when a job is canceled.
Since solve jobs only take about a second to run, I just ignored cancellation. The only thing that takes much time is doing a git-fetch, and it's probably safest not to cancel that anyway.
There's just one last thing, after that we can deploy it on the prod. It is about the switch when a job is canceled.
Since solve jobs only take about a second to run, I just ignored cancellation. The only thing that takes much time is doing a git-fetch, and it's probably safest not to cancel that anyway.
Yes, in the case of domains there's no way to cancel, we could avoid running a request if its switch is off.
To put it in context: Let say we're at the limit, the solver-worker receive 200 jobs from the scheduler and each jobs has 10 platforms (could be more). Suddenly the CI cancel all those jobs (that could happen when there's two consecutive push on opam-repository), now ending up solving 2000 requests per platform, before starting the new jobs.
Let say we're at the limit, the solver-worker receive 200 jobs from the scheduler
Ideally, the solver should only be pulling jobs from the scheduler when it's ready to run them. Otherwise, it might be taking jobs that another worker could be starting immediately. So, it should only have a backlog of a few seconds.
In the stress tests above, for example, I used a capacity of 8, so there should be a maximum of 7 unnecessary jobs run if they all get cancelled.
With that said, I think it would be perfectly fine to add support to the solver for this. But there's always a risk of adding bugs this way. It looks like the current code responds to a cancelled job by killing the entire worker process:
However, starting a replacement worker usually takes 30s or more, so this is probably a bad trade-off!
However, starting a replacement worker usually takes 30s or more, so this is probably a bad trade-off!
I get it, it could be a bad trade-off, we're going to do some test about that situation to see how it behave with the new version.
I've pushed an extra commit adding support for cancellation now.
@kayceesrk wanted some GC stats, so I've pushed a commit adding a stress local
mode, to run the solver in-process. That makes it easier to benchmark things without having to keep restarting the service. Here are the results on my 8-core x86_64 machine:
olly gc-stats './_build/default/stress/stress.exe local --cache-dir=./cache --count=10 --internal-workers=...'
1 worker : GC overhead (% of wall time): 15.58% (1.27s/iter)
2 workers : GC overhead (% of wall time): 30.87% (0.75s/iter)
3 workers : GC overhead (% of wall time): 43.98% (0.60s/iter)
4 workers : GC overhead (% of wall time): 56.23% (0.50s/iter)
5 workers : GC overhead (% of wall time): 67.93% (0.47s/iter)
6 workers : GC overhead (% of wall time): 77.34% (0.48s/iter)
7 workers : GC overhead (% of wall time): 88.53% (0.45s/iter)
Thanks, @talex5. The GC overheads are excessive and keep doubling with cores, which seems odd since the GC work should be parallelisable. So the % overhead should ideally remain flat with an increasing number of cores.
Might be of interest to @sadiqj.
The GC percent is computed against wall-time, which is wrong when number of workers is more than 1. I should've added this disclaimer to olly, sorry. The GC numbers for one worker doesn't look particularly alarming to me. Anyways, let me run some local tests on this and report back (and fix the bug in olly!).
Discovering bugs is the way we make progress :-) Glad to have found it ourselves. Looking forward to the fix.
FWIW, separately I've been discussing with @sadiqj reg including more docs for the runtime events to explain when the events are emitted and the data fields. He may want your help to do this @Sudha247 as you are a user of runtime events.
We tested with ocaml-ci today. It was mostly working, but I saw this once in the logs (Jul 24 10:47:05 arm64-jade-1
):
Need to update https://github.com/ocaml/opam-repository.git to get new commit 0266ae4df7d3eb2e456178a5f3165d99678a246b
From https://github.com/ocaml/opam-repository
* branch HEAD -> FETCH_HEAD
Raised by primitive operation at Unix.map_file in file "unix.ml", line 384, characters 2-45
Called from Git_unix.Major_heap.map in file "src/git-unix/git_unix.ml", line 384, characters 6-75
Called from Carton__Dec.W.heavy_load in file "src/carton/dec.ml", line 592, characters 18-55
Called from Carton__Dec.W.load in file "src/carton/dec.ml", line 621, characters 6-25
Called from Carton__Dec.weight_of_offset in file "src/carton/dec.ml", line 924, characters 8-31
Called from Carton_git.Make.with_resources.(fun) in file "src/carton-git/carton_git.ml", line 150, characters 10-72
Called from Lwt.Sequential_composition.catch in file "src/core/lwt.ml", line 1990, characters 16-20
2023-07-24 10:47.34 solver-worker [WARNING] Build failed: Cancelled: Unix.Unix_error(Unix.EBADF, "map_file", "")
The service was automatically restarted, which fixed it, but I've paused it while I investigate.
The GC percent is computed against wall-time, which is wrong when number of workers is more than 1. I should've added this disclaimer to olly, sorry. The GC numbers for one worker doesn't look particularly alarming to me. Anyways, let me run some local tests on this and report back (and fix the bug in olly!).
There is a PR in progress for the fix: https://github.com/tarides/runtime_events_tools/pull/19
I've added some more locking to avoid the problem above and started the service again on arm64-jade-1
.
It doesn't look like GC is the issue. This is running:
for i in 1 2 4 8 12 16 20 24 28 32; do olly gc-stats OCAMLRUNPARAM="v=0x400" "./_build/default/stress/stress.exe local --cache-dir ./cache --count=10 --internal-workers=$i" >> op 2>&1; done
on a machine with 16 cores and 32 hw threads. Something odd is going on with heap words though.
Internal workers | Allocated words | Minor words | Promoted words | Major words | Minor Collections | Major collections | Heap words | Top heap words | GC overhead (%) |
---|---|---|---|---|---|---|---|---|---|
1 | 41269478700 | 41254670221 | 854179642 | 868988121 | 157219 | 145 | 22709031 | 30940204 | 8.19 |
2 | 41269353013 | 41254541846 | 864377483 | 879188650 | 96966 | 120 | 25339318 | 31089532 | 12.51 |
4 | 41269471338 | 41254661969 | 867621809 | 882431178 | 53045 | 86 | 45524069 | 50378130 | 16.56 |
8 | 41269574198 | 41254762765 | 870674229 | 885485662 | 29568 | 60 | 81483517 | 87401394 | 18.28 |
12 | 41274350994 | 41259540213 | 873924394 | 888735175 | 21704 | 50 | 113651507 | 118256060 | 19.18 |
16 | 41274668096 | 41259857131 | 880376307 | 895187272 | 18413 | 46 | 141591008 | 146173236 | 19 |
20 | 41275384744 | 41260573456 | 885979002 | 900790290 | 16063 | 42 | 165394370 | 169711772 | 20.9 |
24 | 41275582949 | 41260770292 | 890406066 | 905218723 | 14648 | 40 | 209939555 | 214781312 | 18.58 |
28 | 41275669513 | 41260857701 | 886699819 | 901511631 | 12509 | 38 | 223300581 | 227967667 | 16.38 |
32 | 41275970071 | 41261156136 | 867731304 | 882545239 | 9426 | 37 | 190727588 | 196603130 | 8.54 |
In order to debug the heap words things, I tried to build this on OCaml 5.0.0, and got the following error:
#=== ERROR while compiling lwt_eio.0.4 ========================================#
# context 2.1.2 | linux/x86_64 | ocaml-base-compiler.5.0.0 | pinned(git+https://github.com/talex5/lwt_eio.git#27688ec6aa551b9e510cd6a13747880d97451dc4#27688ec6)
# path ~/.opam/5.0.0/.opam-switch/build/lwt_eio.0.4
# command ~/.opam/opam-init/hooks/sandbox.sh build dune build -p lwt_eio -j 4 @install
# exit-code 1
# env-file ~/.opam/log/lwt_eio-488846-7e7dfb.env
# output-file ~/.opam/log/lwt_eio-488846-7e7dfb.out
### output ###
# (cd _build/default && /home/kc/.opam/5.0.0/bin/ocamlc.opt -w -40 -g -bin-annot -I lib/.lwt_eio.objs/byte -I /home/kc/.opam/5.0.0/lib/bigstringaf -I /home/kc/.opam/5.0.0/lib/bytes -I /home/kc/.opam/5.0.0/lib/cstruct -I /home/kc/.opam/5.0.0/lib/domain-local-await -I /home/kc/.opam/5.0.0/lib/eio -I /home/kc/.opam/5.0.0/lib/eio/core -I /home/kc/.opam/5.0.0/lib/eio/unix -I /home/kc/.opam/5.0.0/lib[...]
# File "lib/lwt_eio.ml", line 69, characters 6-28:
# 69 | Lwt_unix.handle_signal Sys.sigchld
# ^^^^^^^^^^^^^^^^^^^^^^
# Error: Unbound value Lwt_unix.handle_signal
# (cd _build/default && /home/kc/.opam/5.0.0/bin/ocamlopt.opt -w -40 -g -I lib/.lwt_eio.objs/byte -I lib/.lwt_eio.objs/native -I /home/kc/.opam/5.0.0/lib/bigstringaf -I /home/kc/.opam/5.0.0/lib/bytes -I /home/kc/.opam/5.0.0/lib/cstruct -I /home/kc/.opam/5.0.0/lib/domain-local-await -I /home/kc/.opam/5.0.0/lib/eio -I /home/kc/.opam/5.0.0/lib/eio/core -I /home/kc/.opam/5.0.0/lib/eio/unix -I /home[...]
# File "lib/lwt_eio.ml", line 69, characters 6-28:
# 69 | Lwt_unix.handle_signal Sys.sigchld
# ^^^^^^^^^^^^^^^^^^^^^^
# Error: Unbound value Lwt_unix.handle_signal
It would be useful to check whether the heap words keep going up on 5.0 as well.
Here's a small test that just allocates some bytes in a loop:
let dom0_worker = false
module Usage = struct
type t = {
wall : float;
user : float;
system : float;
}
let now () =
let wall = Unix.gettimeofday () in
let process = Unix.times () in
{ wall; user = process.tms_utime; system = process.tms_stime }
let ( - ) a b =
{
wall = a.wall -. b.wall;
user = a.user -. b.user;
system = a.system -. b.system;
}
let pp f { wall; user; system } =
Format.fprintf f "wall:%.2fs user:%.2fs system:%.2fs" wall user system
end
let run_worker () =
for _ = 1 to 100_000_000 do
ignore @@ Bytes.create 500
done
let test workers =
let u0 = Usage.now () in
let domain_workers = if dom0_worker then workers - 1 else workers in
let domains = List.init domain_workers (fun _ -> Domain.spawn run_worker) in
if dom0_worker then
run_worker ();
List.iter Domain.join domains;
let u1 = Usage.now () in
Usage.(u1 - u0)
let () =
Format.printf "%s / dom0_worker=%b@." Sys.ocaml_version dom0_worker;
ignore @@ test 1;
let t1 = test 1 in
Format.printf "1 : %a@." Usage.pp t1;
[2; 4; 8; 16; 32; 64; (if dom0_worker then 128 else 127)] |> List.iter (fun i ->
let ti = test i in
let speedup = t1.wall /. ti.wall in
Format.printf "%d : %a (%.2f times faster)@." i Usage.pp ti speedup
)
5.1.0-beta1 seems better than 5.0, but with 128 cores it still takes 30 times longer!
Results from the 160-core arm machine (wall time, lower is better):
Oddly, a previous run had 5.0.0/false being really fast (4.08s for 127 cores), but I can't reproduce that now! Might be something odd going on with CPU frequency scaling when it gets hot or something?
The numbers are curious. So I did a number of things to remove potential noise. First, I run each configuration as a separate run of the program with the aim of avoiding GC impact from the previous runs. Next, I run each program multiple times using hyperfine
, which removes the possibility of a noisy run affecting the result.
The code is here: https://github.com/kayceesrk/code-snippets/tree/master/500_501_regression
The core of the benchmark remains the same as the earlier one. https://github.com/kayceesrk/code-snippets/blob/d3d1838c4199f0aed85add42d686d7a83e2b69c6/500_501_regression/test2.ml#L6-L22
I ran the benchmark on an x86 machine with 16 physical cores, and 2 hardware threads per core. The machine has 2 NUMA domains:
$ lscpu | grep -E "^Thread|^Core|^Socket|^NUMA node\("
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 2
NUMA node(s): 2
Here are the results:
x-axis is wall time in milliseconds. Lower is better. Ideally, the lines will be flat.
The results don't show the regression between 5.0 and 5.1 that you observe on the ARM machine. There is a drop in performance at 8 cores due to NUMA effects. Beyond 16, hyperthreading effects causes further slowdown.
The results don't show the regression between 5.0 and 5.1 that you observe on the ARM machine
It might be just a typo, but all your series are labelled "5.0.0" in that graph - is it showing the 5.1 results?
It was a typo. Fixed here:
I've updated this for Eio 0.12 (which removed all the pins).
I stopped the old solver-worker
and sent a lot of requests, one of the stress test is 250 requests at same time to the 5 new solver-worker
deployed at the moment, it's working very well.
Looks good to me, we can merge.
Updated based on comments. Should be ready to merge now.
This is a port of the solver service to OCaml 5 and Eio. Since most of the code in the old version was managing the child processes, which no longer exist, a lot of the code has changed. In particular, we had some bugs with the parent-child communication channel recently, and that code is now gone completely :-)
There are a few other advantages to the new version:
First, it loads opam files lazily, so start up is much faster. Running the original stress test (solving 3 packages in parallel) gives:
That should help with development, and also if we have multiple clients wanting different opam-repository commits (this can be optimised further; we only cache the most recently used commit for now).
Also, the code is much simpler now.
wc -l service/* worker/*
went from 2261 lines before to 1152 lines at present (I only checked these because the tests have changed).But the main reason is because we want an OCaml 5 service in production to shake out remaining bugs in OCaml, Eio, etc.
As well as getting the stress test to do more solves, and check that they succeed, I added
test/test.ml
with some unit-tests.Notes:
In the steady state it is a bit slower. Doing another 10 rounds of solves with the same three packages, I got 1.19s/iter before and 1.26s/iter afterwards. This is likely due to more GC work from using a single process. Might be able to tune things a bit to improve this.
Spawning processes (e.g. to run
git
oropam
) results in SIGCHLD signals, which can cause Lwt to crash (see https://github.com/ocsigen/lwt/issues/981 and https://github.com/ocsigen/lwt/issues/994), even if Lwt wasn't the one that spawned the process. I've made a PR on Lwt to fix these issues and I've pinned in in the solver's opam file.OCaml 5.1 doesn't have PPC support (but it will be in 5.2), so the deployment will need updating to avoid that.
This version still has some Lwt code (e.g. git_unix), via the lwt_eio bridge. I think this is fine; we want to test the bridge too! We can port more of it to Eio later (in particular, porting git-unix will be useful to complete Eio's FS API).