ocaml / dune

A composable build system for OCaml.
https://dune.build/
MIT License
1.63k stars 401 forks source link

`dune` process hangs on macos #8083

Closed jchavarri closed 1 year ago

jchavarri commented 1 year ago

Recently, we updated Ahrefs codebase to the latest version of Dune (to be more specific, commit 71024ef2e27491753f8790b236eb6041d5947b5f) from the previous version we were using from back in April (7bb6de7087e1a0a50e8cc9836d2a0e8bc200fc25), which was working fine.

After upgrading, all our Linux CI agents work perfectly fine, but on the macOS agents we have noticed that dune processes stay running forever, with CPU usage over 90%.

Here is some example of a dune process running for 55+minutes and with CPU at 100%:

$ ps aux | grep dune
USER              PID  %CPU %MEM      VSZ    RSS   TT  STAT STARTED      TIME COMMAND
user              3759 100.0  0.0 34302304    448   ??  R     3:20PM  55:12.57 /Users/user/builds/macvm100/ahrefs/monorepo-extra/_opam/bin/dune build -p dune-private-libs -j 7 @install
user              3737   0.0  1.1 34302560  93844   ??  S     3:19PM   0:04.48 /Users/user/builds/macvm100/ahrefs/monorepo-extra/_opam/bin/dune build -p dune-private-libs -j 7 @install
user             36569   0.0  0.0 34122844    856 s000  S+    4:15PM   0:00.00 grep dune

Looking at different occurrences of the issue, I couldn't find any pattern, on the packages where it appears. What I can say is that it happens on different versions of macos. In particular, the one in the agent used in the command above is 12.3.1, but it also happens on 12.6.1 and 13.0.1.

I have tried to gather some information about what the hanging dune process is exactly doing. Calling sample 3759 with the process id gathered with ps aux above shows a lot of nested callbacks with camlDune_engine__Process__fun_4514, not sure if this is expected. You can find the full output of that command here:

sample-3759.txt

Is there anything I could do to help diagnose the problem?

jchavarri commented 1 year ago

Here is another result from running ps aux and then sample on one pid from a different case, in a different agent, in case it helps, multiple processes can be seen due to different retries (seems they don't get killed properly on timeout either):

ps aux | grep dune
USER              PID  %CPU %MEM      VSZ    RSS   TT  STAT STARTED      TIME COMMAND
user            51794  54.8  0.0  4369616    140   ??  R    Wed12PM 714:10.99 /Users/user/builds/macvm202/ahrefs/monorepo-extra/_opam/bin/dune build -p extunix -j 7 @install
user            63680  49.1  0.0  4381524    140   ??  R    Wed12PM 713:24.35 /Users/user/builds/macvm202/ahrefs/monorepo-extra/_opam/bin/dune build -p atdgen -j 7 @install
user            82840  49.0  0.0  4389020    128   ??  R    Tue08PM 1571:55.95 /Users/user/builds/macvm202/ahrefs/monorepo-extra/_opam/bin/dune build -p base_quickcheck -j 7
user            70042  48.9  0.0  4413384    128   ??  R    Tue08PM 1574:25.93 /Users/user/builds/macvm202/ahrefs/monorepo-extra/_opam/bin/dune build -p odoc -j 7
user            65598  48.9  0.0  4385204    140   ??  R    Tue08PM 1576:08.01 /Users/user/builds/macvm202/ahrefs/monorepo-extra/_opam/bin/dune build -p ocamlgraph -j 7 @install
user            43954  48.9  0.0  4417832    128   ??  R    Tue08PM 1575:22.95 /Users/user/builds/macvm202/ahrefs/monorepo-extra/_opam/bin/dune build -p merlin-lib -j 7
user            60861  48.8  0.0  4384232    140   ??  R    Wed12PM 704:43.24 /Users/user/builds/macvm202/ahrefs/monorepo-extra/_opam/bin/dune build -p base_quickcheck -j 7
user            73751  48.7  0.0  4420104    416 s000  R+    8:32AM 185:55.48 /Users/user/builds/macvm202/ahrefs/monorepo-extra/_opam/bin/dune build -p camomile -j 7 --promote-install-files=false @install
user            55107  40.5  0.0  4410496    424   ??  R    Wed03PM 603:09.13 /Users/user/builds/macvm202/ahrefs/monorepo-extra/_opam/bin/dune build -p ocamlformat -j 7 @install
user            27690  40.3  0.0  4414336    428   ??  R    Wed03PM 612:21.65 /Users/user/builds/macvm202/ahrefs/monorepo-extra/_opam/bin/dune build -p ocamlformat -j 7 @install
user            69702  40.2  0.0  4376748    140   ??  R    Tue08PM 1570:39.99 /Users/user/builds/macvm202/ahrefs/monorepo-extra/_opam/bin/dune build -p extunix -j 7 @install
user            37138  40.0  0.0  4377368    384   ??  R    Wed03PM 608:47.87 /Users/user/builds/macvm202/ahrefs/monorepo-extra/_opam/bin/dune build -p atdgen-www -j 7
user            63794  39.6  0.0  4377456    140   ??  R    Wed12PM 713:28.93 /Users/user/builds/macvm202/ahrefs/monorepo-extra/_opam/bin/dune build -p atdgen-www -j 7
user            60278  39.5  0.0  4377096    132   ??  R    Wed01PM 647:21.11 /Users/user/builds/macvm202/ahrefs/dune-trim/_opam/.opam-switch/build/dune.20230627.git/./_boot/dune.exe build dune.install --release --profile dune-bootstrap -j 7
user            79056  39.0  0.0  4383864    128   ??  R    Tue08PM 1573:04.25 /Users/user/builds/macvm202/ahrefs/monorepo-extra/_opam/bin/dune build -p ppx_deriving -j 7
user            42147  37.3  0.0  4418300    140   ??  R    Tue08PM 1577:02.57 /Users/user/builds/macvm202/ahrefs/monorepo-extra/_opam/bin/dune build -p camomile -j 7 @install
user            40195  36.7  0.0  4381080    424   ??  R    Wed03PM 608:59.14 /Users/user/builds/macvm202/ahrefs/monorepo-extra/_opam/bin/dune build -p devkit -j 7 @install
user            43079  36.6  0.0  4410240    384   ??  R     5:35PM 550:35.02 /Users/user/builds/macvm202/ahrefs/monorepo-extra/_opam/bin/dune build -p ocamlformat -j 7 @install
user            83921  36.3  0.0  4377224    400   ??  R    Wed03PM 592:26.90 /Users/user/builds/macvm202/ahrefs/dune-trim/_opam/.opam-switch/build/dune.20230627.git/./_boot/dune.exe build dune.install --release --profile dune-bootstrap -j 7
user            44929  35.5  0.0  4419452    140   ??  R    Wed12PM 714:35.21 /Users/user/builds/macvm202/ahrefs/monorepo-extra/_opam/bin/dune build -p camomile -j 7 @install
user            73733   0.1  0.8  4419836  64652 s000  S+    8:32AM   0:34.97 /Users/user/builds/macvm202/ahrefs/monorepo-extra/_opam/bin/dune build -p camomile -j 7 --promote-install-files=false @install
user            10302   0.0  0.0  4334020    736 s001  R+    4:29PM   0:00.00 grep dune

sample-27690.txt

nojb commented 1 year ago

Bisecting is probably your best bet. Before doing that you may want to test commit 886ff3b14adfb876c1b25c5b88c9637b59bd2c37 that caused some issues (on Windows, but you never know).

jchavarri commented 1 year ago

Ok, I am trying to bisect now with older commits, it will take a while as the issue doesn't happen always.

Before doing that you may want to test commit https://github.com/ocaml/dune/commit/886ff3b14adfb876c1b25c5b88c9637b59bd2c37 that caused some issues (on Windows, but you never know).

Hm, I tested this commit and its parent, and both seemed to work fine. Then I noticed the commit is not actually on the main branch, is that right?

nojb commented 1 year ago

Hm, I tested this commit and its parent, and both seemed to work fine. Then I noticed the commit is not actually on the main branch, is that right?

Yes, it is, but it is not the same commit because it was cherry-picked to the release 3.8 branch.

rgrinberg commented 1 year ago

Are these builds hanging in watch mode?

Could you run dtruss -p to see which syscalls are being invoked?

jchavarri commented 1 year ago

Are these builds hanging in watch mode?

They happen in regular, non-watch mode, mostly when installing many packages with opam (where dune is invoked a lot). I am not sure it might happen in watch mode, but I haven't been able to reproduce in watch mode, and I have not received any reports from other developers about that mode.

Could you run dtruss -p to see which syscalls are being invoked?

$ sudo dtruss -p 24983
dtrace: system integrity protection is on, some features will not be available

SYSCALL(args)        = return
select(0x0, 0x0, 0x0, 0x0, 0x700005DD7F90)       = 0 0
select(0x0, 0x0, 0x0, 0x0, 0x700005DD7F90)       = 0 0
select(0x0, 0x0, 0x0, 0x0, 0x700005DD7F90)       = 0 0
....

continues forever (around 10-15 new lines / second).

rgrinberg commented 1 year ago

I don't even understand how this is possible:

select(0x0, 0x0, 0x0, 0x0, 0x700005DD7F90) = 0

From the manual:

select() returns the number of ready descriptors that are contained in the descriptor sets, or -1 if an error occurred. If the time limit expires, select() returns 0

How can the time limit expire if I set it to block indefinitely.

I also don't understand why they stay running forever. If select is constantly returning, it should eventually be shut off by the finalizer.

rgrinberg commented 1 year ago

Try this patch, see if it makes a difference:

diff --git a/src/dune_async_io/async_io.ml b/src/dune_async_io/async_io.ml
index 501a5d2f5..c8f465b70 100644
--- a/src/dune_async_io/async_io.ml
+++ b/src/dune_async_io/async_io.ml
@@ -242,13 +242,7 @@ let with_io scheduler f =
           ~f:(fun () -> select_loop t)
           ~finally:(fun () -> Mutex.unlock t.mutex))
   in
-  Fiber.Var.set t_var t (fun () ->
-      Fiber.finalize f ~finally:(fun () ->
-          Mutex.lock t.mutex;
-          t.running <- false;
-          interrupt t;
-          Mutex.unlock t.mutex;
-          Fiber.return ()))
+  Fiber.Var.set t_var t (fun () -> f)

 let with_ f =
   let+ t = Fiber.Var.get_exn t_var in
jchavarri commented 1 year ago

I can confirm that I've been able to run multiple successful installations of packages (1000+ pkgs are installed each time, most of them using dune) without issues by pinning the version of Dune to e1f3e19cb309ffebc3a7517cd8991f4582ead896 (the commit before 12a0268165d198f8cff12191fe895e1f28d4a109).

So, while we can't confirm that 12a0268165d198f8cff12191fe895e1f28d4a109 is the offender, I think it's quite sure that something changed between that commit and 71024ef2e27491753f8790b236eb6041d5947b5f that caused the regression.

Try this patch, see if it makes a difference:

I will try this next.

jchavarri commented 1 year ago

Maybe related? https://beesbuzz.biz/code/5739-The-problem-with-select-vs-poll

Fwiw we set ulimit -n to a very high value (65536) on the CI agents where we are seeing the problems.

rgrinberg commented 1 year ago

OCaml's select will error if we go over the fd limit.

Also, dune doesn't use that many fd's concurrently anyway so it's unlikely we'll ever run in the issues in that article.

Finally, if that was the problem, it wouldn't be macos specific.

Another piece of advice would be to print all the fd's returned by select when it's looping. You can print them as integers after doing Obj.magic.

jchavarri commented 1 year ago

Another piece of advice would be to print all the fd's returned by select when it's looping. You can print them as integers after doing Obj.magic.

You mean something like this?

diff --git a/src/dune_async_io/async_io.ml b/src/dune_async_io/async_io.ml
index bbfa01011..a7d6aa4c0 100644
--- a/src/dune_async_io/async_io.ml
+++ b/src/dune_async_io/async_io.ml
@@ -190,6 +190,10 @@ let rec select_loop t =
          [t.pipe_write] will interrupt this select *)
       Unix.select read write [] (-1.0)
     in
+    print_endline
+      (Printf.sprintf "Readers: %s\nWriters: %s"
+         (String.concat ~sep:", " (List.map ~f:Obj.magic readers))
+         (String.concat ~sep:", " (List.map ~f:Obj.magic writers)));
     assert (ex = []);
     (* Before we acquire the lock, it's possible that new tasks were added.
        This is fine. *)

I want to be sure I log the right things, as it takes me ~1h for each one of these experiments.

emillon commented 1 year ago

Also, dune doesn't use that many fd's concurrently anyway so it's unlikely we'll ever run in the issues in that article.

If I understand correctly, the problem with select is not only that you can't use it with many fds; it's also that you can't use it with fds that have a high value. So it can be an issue even if there's no fd leak or many concurrently open fds.

nojb commented 1 year ago

select(0x0, 0x0, 0x0, 0x0, 0x700005DD7F90) = 0 0 select(0x0, 0x0, 0x0, 0x0, 0x700005DD7F90) = 0 0 select(0x0, 0x0, 0x0, 0x0, 0x700005DD7F90) = 0 0

Another odd thing: there are two uses of Unix.select in Dune, and both have a non-empty set of read descriptors, which is inconsistent with these calls.

rr0gi commented 1 year ago

(it could be sleep implemented via empty select)

rgrinberg commented 1 year ago

You mean something like this?


diff --git a/src/dune_async_io/async_io.ml b/src/dune_async_io/async_io.ml

index bbfa01011..a7d6aa4c0 100644

--- a/src/dune_async_io/async_io.ml

+++ b/src/dune_async_io/async_io.ml

@@ -190,6 +190,10 @@ let rec select_loop t =

          [t.pipe_write] will interrupt this select *)

       Unix.select read write [] (-1.0)

     in

+    print_endline

+      (Printf.sprintf "Readers: %s\nWriters: %s"

+         (String.concat ~sep:", " (List.map ~f:Obj.magic readers))

+         (String.concat ~sep:", " (List.map ~f:Obj.magic writers)));

     assert (ex = []);

     (* Before we acquire the lock, it's possible that new tasks were added.

        This is fine. *)

I want to be sure I log the right things, as it takes me ~1h for each one of these experiments.

Yes that's correct. I would add a similar log message before the select as well.

rgrinberg commented 1 year ago

(it could be sleep implemented via empty select)

I think that's correct. We have a background thread to handle timers that polls at 10hz.

Mystery solved why we're calling select so much.

Still no idea what's burning all the cpu though.

@jchavarri can you try attaching lldb and giving us all the backtraces

jchavarri commented 1 year ago

There were two dune processes installing melange at the point when things got stuck, so I took info from both. If there's some other commands I can run with lldb please lmk, i have no idea what i'm doing essentially:

$ ps aux | grep dune
j                14097  99.1  0.0 34353444    748 s000  R+    5:33PM   6:52.41 dune build -p melange -j 15 @install
j                30372  99.0  0.0 34258868    120   ??  R    10:33AM 364:30.91 dune build -p melange-json -j 15 @install
j                30102   0.0  0.0 34256820   4772   ??  S    10:33AM   0:05.25 dune build -p melange-json -j 15 @install
j                46493   0.0  0.0 34129560    588 s008  R+    5:43PM   0:00.00 grep --color=auto --exclude-dir=.bzr --exclude-dir=CVS --exclude-dir=.git --exclude-dir=.hg --exclude-dir=.svn --exclude-dir=.idea --exclude-dir=.tox dune
j                13495   0.0  0.0 34353444   5004 s000  S+    5:33PM   0:01.94 dune build -p melange -j 15 @install

$ lldb -p 14097
(lldb) process attach --pid 14097
dune was compiled with optimization - stepping may behave oddly; variables may not be available.
Process 14097 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x68)
    frame #0: 0x0000000108610108 dune`caml_thread_reinitialize at st_stubs.c:409:51 [opt]
Target 0: (dune) stopped.

Executable module set to "/Users/j/Development/github/ahrefs/monorepo/_opam/bin/dune".
Architecture set to: x86_64h-apple-macosx-.
(lldb) frame info
frame #0: 0x0000000108610108 dune`caml_thread_reinitialize at st_stubs.c:409:51 [opt]
(lldb) thread backtrace
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x68)
  * frame #0: 0x0000000108610108 dune`caml_thread_reinitialize at st_stubs.c:409:51 [opt]
    frame #1: 0x00007ff81a5f002f libsystem_pthread.dylib`_pthread_atfork_child_handlers + 69
    frame #2: 0x00007ff81a4df645 libsystem_c.dylib`fork + 84
    frame #3: 0x0000000108619926 dune`spawn_unix(v_env=<unavailable>, v_cwd=<unavailable>, v_prog=<unavailable>, v_argv=<unavailable>, v_stdin=<unavailable>, v_stdout=<unavailable>, v_stderr=17, v_use_vfork=1, v_setpgid=4440459752) at spawn_stubs.c:439:43 [opt]
    frame #4: 0x0000000108535f20 dune`camlSpawn__spawn_inner_993 + 176
    frame #5: 0x00000001084a982d dune`camlDune_engine__Process__fun_4561 + 1501
    frame #6: 0x0000000108189c48 dune`caml_apply2 + 40
    frame #7: 0x000000010852a465 dune`camlFiber__Scheduler__exec_815 + 53
    frame #8: 0x000000010852b02f dune`camlFiber__Scheduler__fun_1357 + 79
    frame #9: 0x00000001085253ba dune`camlFiber__loop_386 + 74
    frame #10: 0x00000001084b1fbb dune`camlDune_engine__Scheduler__run_3164 + 171
    frame #11: 0x00000001084b243d dune`camlDune_engine__Scheduler__run_and_cleanup_3346 + 13
    frame #12: 0x00000001084b3e78 dune`camlDune_engine__Scheduler__fun_5457 + 408
    frame #13: 0x00000001085934f8 dune`camlCmdliner_term__fun_633 + 104
    frame #14: 0x000000010858e677 dune`camlCmdliner_eval__run_parser_537 + 39
    frame #15: 0x000000010858f729 dune`camlCmdliner_eval__eval_value_inner_1711 + 601
    frame #16: 0x000000010818a338 dune`camlMain__entry + 1592
    frame #17: 0x0000000108180259 dune`caml_program + 11913
    frame #18: 0x000000010864c7e1 dune`caml_start_program + 73
    frame #19: 0x0000000108623ea2 dune`caml_startup_common(argv=0x00007ff7b7d833c8, pooling=<unavailable>) at startup_nat.c:160:9 [opt]
    frame #20: 0x0000000108623f1b dune`caml_main [inlined] caml_startup_exn(argv=<unavailable>) at startup_nat.c:167:10 [opt]
    frame #21: 0x0000000108623f14 dune`caml_main [inlined] caml_startup(argv=<unavailable>) at startup_nat.c:172 [opt]
    frame #22: 0x0000000108623f14 dune`caml_main(argv=<unavailable>) at startup_nat.c:179 [opt]
    frame #23: 0x0000000108623f8c dune`main(argc=<unavailable>, argv=<unavailable>) at main.c:37:3 [opt]
    frame #24: 0x00007ff81a2ba310 dyld`start + 2432

$ lldb -p 13495
(lldb) process attach --pid 13495
Process 13495 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
    frame #0: 0x00007ff81a5aefe6 libsystem_kernel.dylib`read + 10
libsystem_kernel.dylib`read:
->  0x7ff81a5aefe6 <+10>: jae    0x7ff81a5aeff0            ; <+20>
    0x7ff81a5aefe8 <+12>: movq   %rax, %rdi
    0x7ff81a5aefeb <+15>: jmp    0x7ff81a5af9ef            ; cerror
    0x7ff81a5aeff0 <+20>: retq
Target 0: (dune) stopped.

Executable module set to "/Users/j/Development/github/ahrefs/monorepo/_opam/bin/dune".
Architecture set to: x86_64h-apple-macosx-.
(lldb) frame info
frame #0: 0x00007ff81a5aefe6 libsystem_kernel.dylib`read + 10
(lldb) thread backtrace
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00007ff81a5aefe6 libsystem_kernel.dylib`read + 10
    frame #1: 0x00000001086199a4 dune`spawn_unix(v_env=<unavailable>, v_cwd=<unavailable>, v_prog=<unavailable>, v_argv=<unavailable>, v_stdin=<unavailable>, v_stdout=<unavailable>, v_stderr=17, v_use_vfork=1, v_setpgid=4440459752) at spawn_stubs.c:456:18 [opt]
    frame #2: 0x0000000108535f20 dune`camlSpawn__spawn_inner_993 + 176
    frame #3: 0x00000001084a982d dune`camlDune_engine__Process__fun_4561 + 1501
    frame #4: 0x0000000108189c48 dune`caml_apply2 + 40
    frame #5: 0x000000010852a465 dune`camlFiber__Scheduler__exec_815 + 53
    frame #6: 0x000000010852b02f dune`camlFiber__Scheduler__fun_1357 + 79
    frame #7: 0x00000001085253ba dune`camlFiber__loop_386 + 74
    frame #8: 0x00000001084b1fbb dune`camlDune_engine__Scheduler__run_3164 + 171
    frame #9: 0x00000001084b243d dune`camlDune_engine__Scheduler__run_and_cleanup_3346 + 13
    frame #10: 0x00000001084b3e78 dune`camlDune_engine__Scheduler__fun_5457 + 408
    frame #11: 0x00000001085934f8 dune`camlCmdliner_term__fun_633 + 104
    frame #12: 0x000000010858e677 dune`camlCmdliner_eval__run_parser_537 + 39
    frame #13: 0x000000010858f729 dune`camlCmdliner_eval__eval_value_inner_1711 + 601
    frame #14: 0x000000010818a338 dune`camlMain__entry + 1592
    frame #15: 0x0000000108180259 dune`caml_program + 11913
    frame #16: 0x000000010864c7e1 dune`caml_start_program + 73
    frame #17: 0x0000000108623ea2 dune`caml_startup_common(argv=0x00007ff7b7d833c8, pooling=<unavailable>) at startup_nat.c:160:9 [opt]
    frame #18: 0x0000000108623f1b dune`caml_main [inlined] caml_startup_exn(argv=<unavailable>) at startup_nat.c:167:10 [opt]
    frame #19: 0x0000000108623f14 dune`caml_main [inlined] caml_startup(argv=<unavailable>) at startup_nat.c:172 [opt]
    frame #20: 0x0000000108623f14 dune`caml_main(argv=<unavailable>) at startup_nat.c:179 [opt]
    frame #21: 0x0000000108623f8c dune`main(argc=<unavailable>, argv=<unavailable>) at main.c:37:3 [opt]
    frame #22: 0x00007ff81a2ba310 dyld`start + 2432
jchavarri commented 1 year ago

the process got stuck with the patch suggested in https://github.com/ocaml/dune/issues/8083#issuecomment-1614560109, so that doesn't seem to fix it.

jchavarri commented 1 year ago

Here's the info with bt all:

lldb.txt

anmonteiro commented 1 year ago

This is reproducible inside the dune codebase on macOS + OCaml 4.14

$ make dev
$ git clone git@github.com:Gbury/dolmen.git

*edit dune-file to include dolmen in excluded dirs*

$ cd dolmen
$ ../_build/install/default/bin/dune build -p dolmen
... hangs
anmonteiro commented 1 year ago

Yes that's correct. I would add a similar log message before the select as well.

I added both these logs messages, and the 2nd one (after the select call) never prints. It looks like the select call never returns.

Adding the print before the select call yields just one reader with a fd of 5 (which, looking at the code, is t.pipe_read):

Readers: 5
Writers:

Next, I tried adding a 2 second (instead of -1.0, unbounded) wait to select, which causes the recursive function to be called again, but Dune still doesn't make progress in the build.

anmonteiro commented 1 year ago

Another piece of advice would be to print all the fd's returned by select when it's looping. You can print them as integers after doing Obj.magic.

You mean something like this?

diff --git a/src/dune_async_io/async_io.ml b/src/dune_async_io/async_io.ml
index bbfa01011..a7d6aa4c0 100644
--- a/src/dune_async_io/async_io.ml
+++ b/src/dune_async_io/async_io.ml
@@ -190,6 +190,10 @@ let rec select_loop t =
          [t.pipe_write] will interrupt this select *)
       Unix.select read write [] (-1.0)
     in
+    print_endline
+      (Printf.sprintf "Readers: %s\nWriters: %s"
+         (String.concat ~sep:", " (List.map ~f:Obj.magic readers))
+         (String.concat ~sep:", " (List.map ~f:Obj.magic writers)));
     assert (ex = []);
     (* Before we acquire the lock, it's possible that new tasks were added.
        This is fine. *)

I want to be sure I log the right things, as it takes me ~1h for each one of these experiments.

Almost. you need string_of_int after calling Obj.magic.

anmonteiro commented 1 year ago

I just reverted #7418 and the hang is still there, so I'm ruling that out as the cause.

anmonteiro commented 1 year ago
90f1d2428b89a8053beaa152e435a5b0c758880f is the first bad commit
commit 90f1d2428b89a8053beaa152e435a5b0c758880f
Author: Rudi Grinberg <me@rgrinberg.com>
Date:   Tue Jun 13 10:56:28 2023 +0100

    feature: enable async sandboxing/digests (#7947)

    These seem to speed up builds in practice

    Signed-off-by: Rudi Grinberg <me@rgrinberg.com>

 CHANGES.md                          | 3 +++
 src/dune_config/config.ml           | 4 ++--
 test/blackbox-tests/test-cases/dune | 3 +++
 3 files changed, 8 insertions(+), 2 deletions(-)

seems like the culprit is #7947

anmonteiro commented 1 year ago

@jchavarri can you try your repro with 3.9 and setting the following environment variable?

DUNE_CONFIG__BACKGROUND_SANDBOXES=disabled

This seems to be the root cause.

rdavison commented 1 year ago

@jchavarri can you try your repro with 3.9 and setting the following environment variable?

DUNE_CONFIG__BACKGROUND_SANDBOXES=disabled

This seems to be the root cause.

Setting this variable worked for me.

anmonteiro commented 1 year ago

I don’t think this was fixed

emillon commented 1 year ago

Correct. There are mitigations in 3.9.1 but the right fix will come from #8090 probably.

nojb commented 1 year ago

Correct. There are mitigations in 3.9.1 but the right fix will come from #8090 probably.

I'm following from a distance, but is the issue understood? If yes, can you say a few words about it? Thanks!

anmonteiro commented 1 year ago

Still not fixed :)

dkalinichenko-js commented 1 year ago

Wait, why did GitHub close the issue? I just synced the Jane Street opam repository, which is not even under the OCaml organization.

emillon commented 1 year ago

Maybe if you have rights on this repo it's enough to close it, I'm not sure.