Closed jchavarri closed 1 year ago
Here is another result from running ps aux
and then sample
on one pid from a different case, in a different agent, in case it helps, multiple processes can be seen due to different retries (seems they don't get killed properly on timeout either):
ps aux | grep dune
USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND
user 51794 54.8 0.0 4369616 140 ?? R Wed12PM 714:10.99 /Users/user/builds/macvm202/ahrefs/monorepo-extra/_opam/bin/dune build -p extunix -j 7 @install
user 63680 49.1 0.0 4381524 140 ?? R Wed12PM 713:24.35 /Users/user/builds/macvm202/ahrefs/monorepo-extra/_opam/bin/dune build -p atdgen -j 7 @install
user 82840 49.0 0.0 4389020 128 ?? R Tue08PM 1571:55.95 /Users/user/builds/macvm202/ahrefs/monorepo-extra/_opam/bin/dune build -p base_quickcheck -j 7
user 70042 48.9 0.0 4413384 128 ?? R Tue08PM 1574:25.93 /Users/user/builds/macvm202/ahrefs/monorepo-extra/_opam/bin/dune build -p odoc -j 7
user 65598 48.9 0.0 4385204 140 ?? R Tue08PM 1576:08.01 /Users/user/builds/macvm202/ahrefs/monorepo-extra/_opam/bin/dune build -p ocamlgraph -j 7 @install
user 43954 48.9 0.0 4417832 128 ?? R Tue08PM 1575:22.95 /Users/user/builds/macvm202/ahrefs/monorepo-extra/_opam/bin/dune build -p merlin-lib -j 7
user 60861 48.8 0.0 4384232 140 ?? R Wed12PM 704:43.24 /Users/user/builds/macvm202/ahrefs/monorepo-extra/_opam/bin/dune build -p base_quickcheck -j 7
user 73751 48.7 0.0 4420104 416 s000 R+ 8:32AM 185:55.48 /Users/user/builds/macvm202/ahrefs/monorepo-extra/_opam/bin/dune build -p camomile -j 7 --promote-install-files=false @install
user 55107 40.5 0.0 4410496 424 ?? R Wed03PM 603:09.13 /Users/user/builds/macvm202/ahrefs/monorepo-extra/_opam/bin/dune build -p ocamlformat -j 7 @install
user 27690 40.3 0.0 4414336 428 ?? R Wed03PM 612:21.65 /Users/user/builds/macvm202/ahrefs/monorepo-extra/_opam/bin/dune build -p ocamlformat -j 7 @install
user 69702 40.2 0.0 4376748 140 ?? R Tue08PM 1570:39.99 /Users/user/builds/macvm202/ahrefs/monorepo-extra/_opam/bin/dune build -p extunix -j 7 @install
user 37138 40.0 0.0 4377368 384 ?? R Wed03PM 608:47.87 /Users/user/builds/macvm202/ahrefs/monorepo-extra/_opam/bin/dune build -p atdgen-www -j 7
user 63794 39.6 0.0 4377456 140 ?? R Wed12PM 713:28.93 /Users/user/builds/macvm202/ahrefs/monorepo-extra/_opam/bin/dune build -p atdgen-www -j 7
user 60278 39.5 0.0 4377096 132 ?? R Wed01PM 647:21.11 /Users/user/builds/macvm202/ahrefs/dune-trim/_opam/.opam-switch/build/dune.20230627.git/./_boot/dune.exe build dune.install --release --profile dune-bootstrap -j 7
user 79056 39.0 0.0 4383864 128 ?? R Tue08PM 1573:04.25 /Users/user/builds/macvm202/ahrefs/monorepo-extra/_opam/bin/dune build -p ppx_deriving -j 7
user 42147 37.3 0.0 4418300 140 ?? R Tue08PM 1577:02.57 /Users/user/builds/macvm202/ahrefs/monorepo-extra/_opam/bin/dune build -p camomile -j 7 @install
user 40195 36.7 0.0 4381080 424 ?? R Wed03PM 608:59.14 /Users/user/builds/macvm202/ahrefs/monorepo-extra/_opam/bin/dune build -p devkit -j 7 @install
user 43079 36.6 0.0 4410240 384 ?? R 5:35PM 550:35.02 /Users/user/builds/macvm202/ahrefs/monorepo-extra/_opam/bin/dune build -p ocamlformat -j 7 @install
user 83921 36.3 0.0 4377224 400 ?? R Wed03PM 592:26.90 /Users/user/builds/macvm202/ahrefs/dune-trim/_opam/.opam-switch/build/dune.20230627.git/./_boot/dune.exe build dune.install --release --profile dune-bootstrap -j 7
user 44929 35.5 0.0 4419452 140 ?? R Wed12PM 714:35.21 /Users/user/builds/macvm202/ahrefs/monorepo-extra/_opam/bin/dune build -p camomile -j 7 @install
user 73733 0.1 0.8 4419836 64652 s000 S+ 8:32AM 0:34.97 /Users/user/builds/macvm202/ahrefs/monorepo-extra/_opam/bin/dune build -p camomile -j 7 --promote-install-files=false @install
user 10302 0.0 0.0 4334020 736 s001 R+ 4:29PM 0:00.00 grep dune
Bisecting is probably your best bet. Before doing that you may want to test commit 886ff3b14adfb876c1b25c5b88c9637b59bd2c37 that caused some issues (on Windows, but you never know).
Ok, I am trying to bisect now with older commits, it will take a while as the issue doesn't happen always.
Before doing that you may want to test commit https://github.com/ocaml/dune/commit/886ff3b14adfb876c1b25c5b88c9637b59bd2c37 that caused some issues (on Windows, but you never know).
Hm, I tested this commit and its parent, and both seemed to work fine. Then I noticed the commit is not actually on the main
branch, is that right?
Hm, I tested this commit and its parent, and both seemed to work fine. Then I noticed the commit is not actually on the
main
branch, is that right?
Yes, it is, but it is not the same commit because it was cherry-picked to the release 3.8
branch.
Are these builds hanging in watch mode?
Could you run dtruss -p
to see which syscalls are being invoked?
Are these builds hanging in watch mode?
They happen in regular, non-watch mode, mostly when installing many packages with opam (where dune
is invoked a lot). I am not sure it might happen in watch mode, but I haven't been able to reproduce in watch mode, and I have not received any reports from other developers about that mode.
Could you run
dtruss -p
to see which syscalls are being invoked?
$ sudo dtruss -p 24983
dtrace: system integrity protection is on, some features will not be available
SYSCALL(args) = return
select(0x0, 0x0, 0x0, 0x0, 0x700005DD7F90) = 0 0
select(0x0, 0x0, 0x0, 0x0, 0x700005DD7F90) = 0 0
select(0x0, 0x0, 0x0, 0x0, 0x700005DD7F90) = 0 0
....
continues forever (around 10-15 new lines / second).
I don't even understand how this is possible:
select(0x0, 0x0, 0x0, 0x0, 0x700005DD7F90) = 0
From the manual:
select() returns the number of ready descriptors that are contained in the descriptor sets, or -1 if an error occurred. If the time limit expires, select() returns 0
How can the time limit expire if I set it to block indefinitely.
I also don't understand why they stay running forever. If select is constantly returning, it should eventually be shut off by the finalizer.
Try this patch, see if it makes a difference:
diff --git a/src/dune_async_io/async_io.ml b/src/dune_async_io/async_io.ml
index 501a5d2f5..c8f465b70 100644
--- a/src/dune_async_io/async_io.ml
+++ b/src/dune_async_io/async_io.ml
@@ -242,13 +242,7 @@ let with_io scheduler f =
~f:(fun () -> select_loop t)
~finally:(fun () -> Mutex.unlock t.mutex))
in
- Fiber.Var.set t_var t (fun () ->
- Fiber.finalize f ~finally:(fun () ->
- Mutex.lock t.mutex;
- t.running <- false;
- interrupt t;
- Mutex.unlock t.mutex;
- Fiber.return ()))
+ Fiber.Var.set t_var t (fun () -> f)
let with_ f =
let+ t = Fiber.Var.get_exn t_var in
I can confirm that I've been able to run multiple successful installations of packages (1000+ pkgs are installed each time, most of them using dune) without issues by pinning the version of Dune to e1f3e19cb309ffebc3a7517cd8991f4582ead896 (the commit before 12a0268165d198f8cff12191fe895e1f28d4a109).
So, while we can't confirm that 12a0268165d198f8cff12191fe895e1f28d4a109 is the offender, I think it's quite sure that something changed between that commit and 71024ef2e27491753f8790b236eb6041d5947b5f that caused the regression.
Try this patch, see if it makes a difference:
I will try this next.
Maybe related? https://beesbuzz.biz/code/5739-The-problem-with-select-vs-poll
Fwiw we set ulimit -n
to a very high value (65536) on the CI agents where we are seeing the problems.
OCaml's select will error if we go over the fd limit.
Also, dune doesn't use that many fd's concurrently anyway so it's unlikely we'll ever run in the issues in that article.
Finally, if that was the problem, it wouldn't be macos specific.
Another piece of advice would be to print all the fd's returned by select when it's looping. You can print them as integers after doing Obj.magic
.
Another piece of advice would be to print all the fd's returned by select when it's looping. You can print them as integers after doing
Obj.magic
.
You mean something like this?
diff --git a/src/dune_async_io/async_io.ml b/src/dune_async_io/async_io.ml
index bbfa01011..a7d6aa4c0 100644
--- a/src/dune_async_io/async_io.ml
+++ b/src/dune_async_io/async_io.ml
@@ -190,6 +190,10 @@ let rec select_loop t =
[t.pipe_write] will interrupt this select *)
Unix.select read write [] (-1.0)
in
+ print_endline
+ (Printf.sprintf "Readers: %s\nWriters: %s"
+ (String.concat ~sep:", " (List.map ~f:Obj.magic readers))
+ (String.concat ~sep:", " (List.map ~f:Obj.magic writers)));
assert (ex = []);
(* Before we acquire the lock, it's possible that new tasks were added.
This is fine. *)
I want to be sure I log the right things, as it takes me ~1h for each one of these experiments.
Also, dune doesn't use that many fd's concurrently anyway so it's unlikely we'll ever run in the issues in that article.
If I understand correctly, the problem with select is not only that you can't use it with many fds; it's also that you can't use it with fds that have a high value. So it can be an issue even if there's no fd leak or many concurrently open fds.
select(0x0, 0x0, 0x0, 0x0, 0x700005DD7F90) = 0 0 select(0x0, 0x0, 0x0, 0x0, 0x700005DD7F90) = 0 0 select(0x0, 0x0, 0x0, 0x0, 0x700005DD7F90) = 0 0
Another odd thing: there are two uses of Unix.select
in Dune, and both have a non-empty set of read descriptors, which is inconsistent with these calls.
(it could be sleep implemented via empty select)
You mean something like this?
diff --git a/src/dune_async_io/async_io.ml b/src/dune_async_io/async_io.ml index bbfa01011..a7d6aa4c0 100644 --- a/src/dune_async_io/async_io.ml +++ b/src/dune_async_io/async_io.ml @@ -190,6 +190,10 @@ let rec select_loop t = [t.pipe_write] will interrupt this select *) Unix.select read write [] (-1.0) in + print_endline + (Printf.sprintf "Readers: %s\nWriters: %s" + (String.concat ~sep:", " (List.map ~f:Obj.magic readers)) + (String.concat ~sep:", " (List.map ~f:Obj.magic writers))); assert (ex = []); (* Before we acquire the lock, it's possible that new tasks were added. This is fine. *)
I want to be sure I log the right things, as it takes me ~1h for each one of these experiments.
Yes that's correct. I would add a similar log message before the select as well.
(it could be sleep implemented via empty select)
I think that's correct. We have a background thread to handle timers that polls at 10hz.
Mystery solved why we're calling select so much.
Still no idea what's burning all the cpu though.
@jchavarri can you try attaching lldb and giving us all the backtraces
There were two dune
processes installing melange
at the point when things got stuck, so I took info from both. If there's some other commands I can run with lldb
please lmk, i have no idea what i'm doing essentially:
$ ps aux | grep dune
j 14097 99.1 0.0 34353444 748 s000 R+ 5:33PM 6:52.41 dune build -p melange -j 15 @install
j 30372 99.0 0.0 34258868 120 ?? R 10:33AM 364:30.91 dune build -p melange-json -j 15 @install
j 30102 0.0 0.0 34256820 4772 ?? S 10:33AM 0:05.25 dune build -p melange-json -j 15 @install
j 46493 0.0 0.0 34129560 588 s008 R+ 5:43PM 0:00.00 grep --color=auto --exclude-dir=.bzr --exclude-dir=CVS --exclude-dir=.git --exclude-dir=.hg --exclude-dir=.svn --exclude-dir=.idea --exclude-dir=.tox dune
j 13495 0.0 0.0 34353444 5004 s000 S+ 5:33PM 0:01.94 dune build -p melange -j 15 @install
$ lldb -p 14097
(lldb) process attach --pid 14097
dune was compiled with optimization - stepping may behave oddly; variables may not be available.
Process 14097 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x68)
frame #0: 0x0000000108610108 dune`caml_thread_reinitialize at st_stubs.c:409:51 [opt]
Target 0: (dune) stopped.
Executable module set to "/Users/j/Development/github/ahrefs/monorepo/_opam/bin/dune".
Architecture set to: x86_64h-apple-macosx-.
(lldb) frame info
frame #0: 0x0000000108610108 dune`caml_thread_reinitialize at st_stubs.c:409:51 [opt]
(lldb) thread backtrace
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x68)
* frame #0: 0x0000000108610108 dune`caml_thread_reinitialize at st_stubs.c:409:51 [opt]
frame #1: 0x00007ff81a5f002f libsystem_pthread.dylib`_pthread_atfork_child_handlers + 69
frame #2: 0x00007ff81a4df645 libsystem_c.dylib`fork + 84
frame #3: 0x0000000108619926 dune`spawn_unix(v_env=<unavailable>, v_cwd=<unavailable>, v_prog=<unavailable>, v_argv=<unavailable>, v_stdin=<unavailable>, v_stdout=<unavailable>, v_stderr=17, v_use_vfork=1, v_setpgid=4440459752) at spawn_stubs.c:439:43 [opt]
frame #4: 0x0000000108535f20 dune`camlSpawn__spawn_inner_993 + 176
frame #5: 0x00000001084a982d dune`camlDune_engine__Process__fun_4561 + 1501
frame #6: 0x0000000108189c48 dune`caml_apply2 + 40
frame #7: 0x000000010852a465 dune`camlFiber__Scheduler__exec_815 + 53
frame #8: 0x000000010852b02f dune`camlFiber__Scheduler__fun_1357 + 79
frame #9: 0x00000001085253ba dune`camlFiber__loop_386 + 74
frame #10: 0x00000001084b1fbb dune`camlDune_engine__Scheduler__run_3164 + 171
frame #11: 0x00000001084b243d dune`camlDune_engine__Scheduler__run_and_cleanup_3346 + 13
frame #12: 0x00000001084b3e78 dune`camlDune_engine__Scheduler__fun_5457 + 408
frame #13: 0x00000001085934f8 dune`camlCmdliner_term__fun_633 + 104
frame #14: 0x000000010858e677 dune`camlCmdliner_eval__run_parser_537 + 39
frame #15: 0x000000010858f729 dune`camlCmdliner_eval__eval_value_inner_1711 + 601
frame #16: 0x000000010818a338 dune`camlMain__entry + 1592
frame #17: 0x0000000108180259 dune`caml_program + 11913
frame #18: 0x000000010864c7e1 dune`caml_start_program + 73
frame #19: 0x0000000108623ea2 dune`caml_startup_common(argv=0x00007ff7b7d833c8, pooling=<unavailable>) at startup_nat.c:160:9 [opt]
frame #20: 0x0000000108623f1b dune`caml_main [inlined] caml_startup_exn(argv=<unavailable>) at startup_nat.c:167:10 [opt]
frame #21: 0x0000000108623f14 dune`caml_main [inlined] caml_startup(argv=<unavailable>) at startup_nat.c:172 [opt]
frame #22: 0x0000000108623f14 dune`caml_main(argv=<unavailable>) at startup_nat.c:179 [opt]
frame #23: 0x0000000108623f8c dune`main(argc=<unavailable>, argv=<unavailable>) at main.c:37:3 [opt]
frame #24: 0x00007ff81a2ba310 dyld`start + 2432
$ lldb -p 13495
(lldb) process attach --pid 13495
Process 13495 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
frame #0: 0x00007ff81a5aefe6 libsystem_kernel.dylib`read + 10
libsystem_kernel.dylib`read:
-> 0x7ff81a5aefe6 <+10>: jae 0x7ff81a5aeff0 ; <+20>
0x7ff81a5aefe8 <+12>: movq %rax, %rdi
0x7ff81a5aefeb <+15>: jmp 0x7ff81a5af9ef ; cerror
0x7ff81a5aeff0 <+20>: retq
Target 0: (dune) stopped.
Executable module set to "/Users/j/Development/github/ahrefs/monorepo/_opam/bin/dune".
Architecture set to: x86_64h-apple-macosx-.
(lldb) frame info
frame #0: 0x00007ff81a5aefe6 libsystem_kernel.dylib`read + 10
(lldb) thread backtrace
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
* frame #0: 0x00007ff81a5aefe6 libsystem_kernel.dylib`read + 10
frame #1: 0x00000001086199a4 dune`spawn_unix(v_env=<unavailable>, v_cwd=<unavailable>, v_prog=<unavailable>, v_argv=<unavailable>, v_stdin=<unavailable>, v_stdout=<unavailable>, v_stderr=17, v_use_vfork=1, v_setpgid=4440459752) at spawn_stubs.c:456:18 [opt]
frame #2: 0x0000000108535f20 dune`camlSpawn__spawn_inner_993 + 176
frame #3: 0x00000001084a982d dune`camlDune_engine__Process__fun_4561 + 1501
frame #4: 0x0000000108189c48 dune`caml_apply2 + 40
frame #5: 0x000000010852a465 dune`camlFiber__Scheduler__exec_815 + 53
frame #6: 0x000000010852b02f dune`camlFiber__Scheduler__fun_1357 + 79
frame #7: 0x00000001085253ba dune`camlFiber__loop_386 + 74
frame #8: 0x00000001084b1fbb dune`camlDune_engine__Scheduler__run_3164 + 171
frame #9: 0x00000001084b243d dune`camlDune_engine__Scheduler__run_and_cleanup_3346 + 13
frame #10: 0x00000001084b3e78 dune`camlDune_engine__Scheduler__fun_5457 + 408
frame #11: 0x00000001085934f8 dune`camlCmdliner_term__fun_633 + 104
frame #12: 0x000000010858e677 dune`camlCmdliner_eval__run_parser_537 + 39
frame #13: 0x000000010858f729 dune`camlCmdliner_eval__eval_value_inner_1711 + 601
frame #14: 0x000000010818a338 dune`camlMain__entry + 1592
frame #15: 0x0000000108180259 dune`caml_program + 11913
frame #16: 0x000000010864c7e1 dune`caml_start_program + 73
frame #17: 0x0000000108623ea2 dune`caml_startup_common(argv=0x00007ff7b7d833c8, pooling=<unavailable>) at startup_nat.c:160:9 [opt]
frame #18: 0x0000000108623f1b dune`caml_main [inlined] caml_startup_exn(argv=<unavailable>) at startup_nat.c:167:10 [opt]
frame #19: 0x0000000108623f14 dune`caml_main [inlined] caml_startup(argv=<unavailable>) at startup_nat.c:172 [opt]
frame #20: 0x0000000108623f14 dune`caml_main(argv=<unavailable>) at startup_nat.c:179 [opt]
frame #21: 0x0000000108623f8c dune`main(argc=<unavailable>, argv=<unavailable>) at main.c:37:3 [opt]
frame #22: 0x00007ff81a2ba310 dyld`start + 2432
the process got stuck with the patch suggested in https://github.com/ocaml/dune/issues/8083#issuecomment-1614560109, so that doesn't seem to fix it.
This is reproducible inside the dune codebase on macOS + OCaml 4.14
$ make dev
$ git clone git@github.com:Gbury/dolmen.git
*edit dune-file to include dolmen in excluded dirs*
$ cd dolmen
$ ../_build/install/default/bin/dune build -p dolmen
... hangs
Yes that's correct. I would add a similar log message before the select as well.
I added both these logs messages, and the 2nd one (after the select
call) never prints. It looks like the select
call never returns.
Adding the print before the select call yields just one reader with a fd of 5 (which, looking at the code, is t.pipe_read
):
Readers: 5
Writers:
Next, I tried adding a 2 second (instead of -1.0, unbounded) wait to select
, which causes the recursive function to be called again, but Dune still doesn't make progress in the build.
Another piece of advice would be to print all the fd's returned by select when it's looping. You can print them as integers after doing
Obj.magic
.You mean something like this?
diff --git a/src/dune_async_io/async_io.ml b/src/dune_async_io/async_io.ml index bbfa01011..a7d6aa4c0 100644 --- a/src/dune_async_io/async_io.ml +++ b/src/dune_async_io/async_io.ml @@ -190,6 +190,10 @@ let rec select_loop t = [t.pipe_write] will interrupt this select *) Unix.select read write [] (-1.0) in + print_endline + (Printf.sprintf "Readers: %s\nWriters: %s" + (String.concat ~sep:", " (List.map ~f:Obj.magic readers)) + (String.concat ~sep:", " (List.map ~f:Obj.magic writers))); assert (ex = []); (* Before we acquire the lock, it's possible that new tasks were added. This is fine. *)
I want to be sure I log the right things, as it takes me ~1h for each one of these experiments.
Almost. you need string_of_int
after calling Obj.magic
.
I just reverted #7418 and the hang is still there, so I'm ruling that out as the cause.
90f1d2428b89a8053beaa152e435a5b0c758880f is the first bad commit
commit 90f1d2428b89a8053beaa152e435a5b0c758880f
Author: Rudi Grinberg <me@rgrinberg.com>
Date: Tue Jun 13 10:56:28 2023 +0100
feature: enable async sandboxing/digests (#7947)
These seem to speed up builds in practice
Signed-off-by: Rudi Grinberg <me@rgrinberg.com>
CHANGES.md | 3 +++
src/dune_config/config.ml | 4 ++--
test/blackbox-tests/test-cases/dune | 3 +++
3 files changed, 8 insertions(+), 2 deletions(-)
seems like the culprit is #7947
@jchavarri can you try your repro with 3.9 and setting the following environment variable?
DUNE_CONFIG__BACKGROUND_SANDBOXES=disabled
This seems to be the root cause.
@jchavarri can you try your repro with 3.9 and setting the following environment variable?
DUNE_CONFIG__BACKGROUND_SANDBOXES=disabled
This seems to be the root cause.
Setting this variable worked for me.
I don’t think this was fixed
Correct. There are mitigations in 3.9.1 but the right fix will come from #8090 probably.
Correct. There are mitigations in 3.9.1 but the right fix will come from #8090 probably.
I'm following from a distance, but is the issue understood? If yes, can you say a few words about it? Thanks!
Still not fixed :)
Wait, why did GitHub close the issue? I just synced the Jane Street opam repository, which is not even under the OCaml organization.
Maybe if you have rights on this repo it's enough to close it, I'm not sure.
Recently, we updated Ahrefs codebase to the latest version of Dune (to be more specific, commit 71024ef2e27491753f8790b236eb6041d5947b5f) from the previous version we were using from back in April (7bb6de7087e1a0a50e8cc9836d2a0e8bc200fc25), which was working fine.
After upgrading, all our Linux CI agents work perfectly fine, but on the macOS agents we have noticed that dune processes stay running forever, with CPU usage over 90%.
Here is some example of a dune process running for 55+minutes and with CPU at 100%:
Looking at different occurrences of the issue, I couldn't find any pattern, on the packages where it appears. What I can say is that it happens on different versions of macos. In particular, the one in the agent used in the command above is
12.3.1
, but it also happens on12.6.1
and13.0.1
.I have tried to gather some information about what the hanging
dune
process is exactly doing. Callingsample 3759
with the process id gathered withps aux
above shows a lot of nested callbacks withcamlDune_engine__Process__fun_4514
, not sure if this is expected. You can find the full output of that command here:sample-3759.txt
Is there anything I could do to help diagnose the problem?