leanprover / lean4

Lean 4 programming language and theorem prover
https://lean-lang.org
Apache License 2.0
4.51k stars 395 forks source link

Watchdog<->file worker deadlock on shutdown #5296

Open Kha opened 1 week ago

Kha commented 1 week ago

Observed a few times in the wild: watchdog is trying to shut down and waiting on all commTasks:

#0  0x00007fad46a9dc5e in __futex_abstimed_wait_common () from /nix/store/k7zgvzp2r31zkg9xqgjim7mbknryv6bs-glibc-2.39-52/lib/libc.so.6
#1  0x00007fad46aa04c0 in pthread_cond_wait@@GLIBC_2.3.2 () from /nix/store/k7zgvzp2r31zkg9xqgjim7mbknryv6bs-glibc-2.39-52/lib/libc.so.6
#2  0x00007fad4b9efccf in std::__1::condition_variable::wait(std::__1::unique_lock<std::__1::mutex>&) () from /home/jmc/.elan/toolchains/leanprover--lean4---nightly-2024-09-06/lib/lean/libleanshared.so
#3  0x00007fad4b9ccc9b in lean_task_get () from /home/jmc/.elan/toolchains/leanprover--lean4---nightly-2024-09-06/lib/lean/libleanshared.so
#4  0x00007fad4b9ea65c in lean_io_wait () from /home/jmc/.elan/toolchains/leanprover--lean4---nightly-2024-09-06/lib/lean/libleanshared.so
#5  0x00007fad4a3f1623 in l_Lean_RBNode_forIn_visit___at_Lean_Server_Watchdog_shutdown___spec__1 () from /home/jmc/.elan/toolchains/leanprover--lean4---nightly-2024-09-06/lib/lean/libleanshared.so
#6  0x00007fad4a3f1be7 in l_Lean_Server_Watchdog_shutdown () from /home/jmc/.elan/toolchains/leanprover--lean4---nightly-2024-09-06/lib/lean/libleanshared.so

At least one commTask is waiting for the child to exit, which it does after any error in the forwardMessages loop (unclear which error in this case, as std handles should still be open):

#0  0x00007fad46b06577 in wait4 () from /nix/store/k7zgvzp2r31zkg9xqgjim7mbknryv6bs-glibc-2.39-52/lib/libc.so.6
#1  0x00007fad4b9ee552 in lean_io_process_child_wait () from /home/jmc/.elan/toolchains/leanprover--lean4---nightly-2024-09-06/lib/lean/libleanshared.so
#2  0x00007fad4a3b54dd in l___private_Lean_Server_Watchdog_0__Lean_Server_Watchdog_forwardMessages_loop () from /home/jmc/.elan/toolchains/leanprover--lean4---nightly-2024-09-06/lib/lean/libleanshared.so

Meanwhile the file worker itself is trying to exit, waiting on its std handles lock:

#0  0x00007f83afbbae4a in __lll_lock_wait_private () from /nix/store/k7zgvzp2r31zkg9xqgjim7mbknryv6bs-glibc-2.39-52/lib/libc.so.6
#1  0x00007f83afbb8525 in _IO_flush_all () from /nix/store/k7zgvzp2r31zkg9xqgjim7mbknryv6bs-glibc-2.39-52/lib/libc.so.6
#2  0x00007f83afbb88fa in _IO_cleanup () from /nix/store/k7zgvzp2r31zkg9xqgjim7mbknryv6bs-glibc-2.39-52/lib/libc.so.6
#3  0x00007f83afb723f0 in __run_exit_handlers () from /nix/store/k7zgvzp2r31zkg9xqgjim7mbknryv6bs-glibc-2.39-52/lib/libc.so.6
#4  0x00007f83afb7242e in exit () from /nix/store/k7zgvzp2r31zkg9xqgjim7mbknryv6bs-glibc-2.39-52/lib/libc.so.6
#5  0x00007f83b4bea736 in lean_io_exit () from /home/jmc/.elan/toolchains/leanprover--lean4---nightly-2024-09-06/lib/lean/libleanshared.so
#6  0x00007f83b3695937 in lean_server_worker_main () from /home/jmc/.elan/toolchains/leanprover--lean4---nightly-2024-09-06/lib/lean/libleanshared.so

Which is held by a thread trying to write to stdout, which presumably is full:

#0  0x00007f83afc2d70f in write () from /nix/store/k7zgvzp2r31zkg9xqgjim7mbknryv6bs-glibc-2.39-52/lib/libc.so.6
#1  0x00007f83afbb651d in _IO_file_write@@GLIBC_2.2.5 () from /nix/store/k7zgvzp2r31zkg9xqgjim7mbknryv6bs-glibc-2.39-52/lib/libc.so.6
#2  0x00007f83afbb47e4 in new_do_write () from /nix/store/k7zgvzp2r31zkg9xqgjim7mbknryv6bs-glibc-2.39-52/lib/libc.so.6
#3  0x00007f83afbb55e9 in __GI__IO_do_write () from /nix/store/k7zgvzp2r31zkg9xqgjim7mbknryv6bs-glibc-2.39-52/lib/libc.so.6
#4  0x00007f83afbb5cd8 in __GI__IO_file_sync () from /nix/store/k7zgvzp2r31zkg9xqgjim7mbknryv6bs-glibc-2.39-52/lib/libc.so.6
#5  0x00007f83afba9364 in fflush () from /nix/store/k7zgvzp2r31zkg9xqgjim7mbknryv6bs-glibc-2.39-52/lib/libc.so.6
#6  0x00007f83b4be866c in lean_io_prim_handle_flush () from /home/jmc/.elan/toolchains/leanprover--lean4---nightly-2024-09-06/lib/lean/libleanshared.so
#7  0x00007f83b0e2499f in l_IO_FS_Stream_ofHandle___elambda__6___boxed () from /home/jmc/.elan/toolchains/leanprover--lean4---nightly-2024-09-06/lib/lean/libleanshared.so
#8  0x00007f83b4bd69f0 in lean_apply_1 () from /home/jmc/.elan/toolchains/leanprover--lean4---nightly-2024-09-06/lib/lean/libleanshared.so
#9  0x00007f83b366996e in l_Lean_Server_FileWorker_initializeWorker_mkLspOutputChannel___lambda__1 () from /home/jmc/.elan/toolchains/leanprover--lean4---nightly-2024-09-06/lib/lean/libleanshared.so
Kha commented 1 week ago

Possible solution: the watchdog should close the file worker's stdout before invoking Child.wait.