Open zbaylin opened 1 year ago
Update: I didn't realize when I made this issue that version constraints in my local switch forced me to use 1.15.1-4.14
. Upgrading to 1.16.2 seems to fix the issue, but not 100% sure.
Can confirm that this issue is still present on 1.16.2
-- it also looks like CPU usage for the process shoots up to about 100% when the cancellations start to occur.
Thanks for you report @zbaylin
Is your project public ? Is there any way for us to reproduce the issue ?
Unfortunately not, let me see if I can make a repro repo.
@voodoos unfortunately I wasn't able to come up with an easily reproducible example. However, I did maybe get some more information? I attached lldb
to a running "frozen" ocamllsp
process. As soon as I did that, the following breakpoint was triggered (even though I hadn't set any):
(lldb) process attach --pid 38276
Process 38276 stopped
* thread #1, stop reason = EXC_BREAKPOINT (code=1, subcode=0x188124220)
frame #0: 0x0000000188124220 libsystem_platform.dylib`_os_unfair_lock_corruption_abort + 52
libsystem_platform.dylib`:
-> 0x188124220 <+52>: brk #0x1
0x188124224 <+56>: stp x20, x21, [sp, #-0x10]!
0x188124228 <+60>: adrp x20, 0
0x18812422c <+64>: add x20, x20, #0x95d ; "BUG IN CLIENT OF LIBPLATFORM: os_unfair_lock is corrupt"
Target 0: (ocamllsp) stopped.
Executable module set to "[...]/ocamllsp".
Architecture set to: arm64-apple-macosx-.
Continuing the process immediately hits the same breakpoint. Here's the backtrace from lldb:
* thread #1, stop reason = EXC_BREAKPOINT (code=1, subcode=0x188124220)
* frame #0: 0x0000000188124220 libsystem_platform.dylib`_os_unfair_lock_corruption_abort + 52
frame #1: 0x000000018811f788 libsystem_platform.dylib`_os_unfair_lock_lock_slow + 332
frame #2: 0x00000001880ee3f0 libsystem_pthread.dylib`pthread_mutex_destroy + 64
frame #3: 0x0000000104585ccc ocamllsp`caml_thread_reinitialize [inlined] st_mutex_destroy(m=0x000060000199c040) at st_posix.h:228:8 [opt]
frame #4: 0x0000000104585cc4 ocamllsp`caml_thread_reinitialize at st_stubs.c:425:7 [opt]
frame #5: 0x00000001880f6f7c libsystem_pthread.dylib`_pthread_atfork_child_handlers + 76
frame #6: 0x0000000187faeb90 libsystem_c.dylib`fork + 112
frame #7: 0x0000000104584a0c ocamllsp`spawn_unix(v_env=<unavailable>, v_cwd=<unavailable>, v_prog=<unavailable>, v_argv=<unavailable>, v_stdin=<unavailable>, v_stdout=<unavailable>, v_stderr=<unavailable>, v_use_vfork=<unavailable>, v_setpgid=1) at spawn_stubs.c:439:43 [opt]
frame #8: 0x00000001045bb790 ocamllsp`caml_c_call + 28
frame #9: 0x00000001043da178 ocamllsp`camlSpawn__spawn_inner_997 + 152
frame #10: 0x0000000104133d24 ocamllsp`camlOcaml_lsp_server__run_in_directory_5864 + 500
frame #11: 0x00000001041c6ed0 ocamllsp`camlMerlin_kernel__Pparse__apply_rewriter_786 + 296
frame #12: 0x00000001041c73d0 ocamllsp`camlMerlin_kernel__Pparse__rewrite_957 + 144
frame #13: 0x00000001041c744c ocamllsp`camlMerlin_kernel__Pparse__apply_rewriters_str_inner_1399 + 68
frame #14: 0x00000001041c751c ocamllsp`camlMerlin_kernel__Pparse__apply_rewriters_1089 + 52
frame #15: 0x00000001041dcd0c ocamllsp`camlMerlin_kernel__Mppx__fun_1203 + 60
frame #16: 0x00000001041dcbd0 ocamllsp`camlMerlin_kernel__Mppx__code_begin + 136
frame #17: 0x00000001041c7fc4 ocamllsp`camlMerlin_kernel__Phase_cache__apply_inner_528 + 852
frame #18: 0x00000001041de448 ocamllsp`camlMerlin_kernel__Mpipeline__fun_1778 + 152
frame #19: 0x00000001043c4578 ocamllsp`camlMerlin_utils__Misc__try_finally_inner_3715 + 48
frame #20: 0x000000010452b10c ocamllsp`camlCamlinternalLazy__force_lazy_block_362 + 140
frame #21: 0x00000001041dcfd4 ocamllsp`camlMerlin_kernel__Mpipeline__fun_1613 + 172
frame #22: 0x000000010452b10c ocamllsp`camlCamlinternalLazy__force_lazy_block_362 + 140
frame #23: 0x00000001041de4ec ocamllsp`camlMerlin_kernel__Mpipeline__fun_1791 + 60
frame #24: 0x000000010452b10c ocamllsp`camlCamlinternalLazy__force_lazy_block_362 + 140
frame #25: 0x00000001041dcfd4 ocamllsp`camlMerlin_kernel__Mpipeline__fun_1613 + 172
frame #26: 0x000000010452b10c ocamllsp`camlCamlinternalLazy__force_lazy_block_362 + 140
frame #27: 0x00000001041a3470 ocamllsp`camlQuery_commands__dispatch_2992 + 7320
frame #28: 0x00000001043bf4cc ocamllsp`camlMerlin_utils__Std__let_ref_3289 + 60
frame #29: 0x00000001043c4578 ocamllsp`camlMerlin_utils__Misc__try_finally_inner_3715 + 48
frame #30: 0x0000000104569818 ocamllsp`camlStdlib__Fun__protect_317 + 96
frame #31: 0x00000001041d3480 ocamllsp`camlMerlin_kernel__Mocaml__with_state_438 + 80
frame #32: 0x00000001040fa9b4 ocamllsp`camlOcaml_lsp_server__Document__fun_4120 + 116
frame #33: 0x0000000104513508 ocamllsp`camlStdune__Exn_with_backtrace__try_with_422 + 40
frame #34: 0x00000001043dc6e0 ocamllsp`camlLev_fiber__do_no_raise_1549 + 24
frame #35: 0x00000001043e7820 ocamllsp`camlLev_fiber_util__Worker__run_322 + 72
frame #36: 0x00000001043e8de4 ocamllsp`camlThread__fun_850 + 44
frame #37: 0x00000001045bb800 ocamllsp`caml_start_program + 104
frame #38: 0x00000001045b1770 ocamllsp`caml_callback_exn(closure=<unavailable>, arg=1) at callback.c:111:10 [opt]
frame #39: 0x0000000104585f38 ocamllsp`caml_thread_start(arg=<unavailable>) at st_stubs.c:549:5 [opt]
frame #40: 0x00000001880f3034 libsystem_pthread.dylib`_pthread_start + 136
(Sorry for the spam, but I hope that helps) It's unclear to me whether this is a bug in merlin or ocaml-lsp from the above, but it does seem like a deadlock is happening somewhere
I am wondering whether this could be linked to the OS update. Could you try to reproduce on a freshly built switch ?
I did build a brand new switch after first encountering this issue but sadly it's still present
@voodoos I was able to somewhat fix this (I think) by removing this line: https://github.com/ocaml/ocaml-lsp/blob/672ca1bfbb816ff6b5f07a3f3abdecb66f982d8c/ocaml-lsp-server/src/ocaml_lsp_server.ml#L920
which does square with @gasche's suggestion in the above linked PR that this might be an issue with the spawn library. Is there a reason the merlin version is replaced by a custom implementation? I would imagine spawn
would be quicker than using Sys.command
but have no evidence to back that up.
Interesting, so it might be an issue with the Spawn library, or the way we use it. Still you didn't notice anything wrong before the OS update ?
The hang happens when running ppxes. Could you check that this isn't specific to the ppx your using ? I guess buffers without ppxes shouldn't hang and buffers with any ppxes should ?
Also @rgrinberg may have more hindsight here ?
To summarize my reasoning on the upstream issue https://github.com/ocaml/ocaml/issues/12636 :
fork()
.fork()
anymore, that is, by using another approach than janestreet/spawn to run what is basically a Sys.command
call in https://github.com/ocaml/ocaml-lsp/blob/672ca1bfbb816ff6b5f07a3f3abdecb66f982d8c/ocaml-lsp-server/src/ocaml_lsp_server.ml#L900 (Note: the issue may be fixed in OCaml 5 (rather than OCaml 4.14), because the implementation of threads is different and may interact better with forking in this case.)
Thanks @gasche for the summary! In the linked PR it seems Xavier Leroy (avoiding tagging) has confirmed that this is a bug in the 4.14 runtime
The mutexes that protect I/O channels are being destroyed in the child process after fork() even though they can be locked, which is an error according to the POSIX standard and seems to be turned into a fatal error by the pthreads library.
Having not used posix_spawn
before I'm hesitant to suggest using it would be a temporary fix of the issue here, but from what I have read it does seem possible:
Since glibc 2.24, the posix_spawn() function commences by calling clone(2) with
CLONE_VM
andCLONE_VFORK
flags. 1If
CLONE_VM
is set, the calling process and the child process run in the same memory space. 2
I looked into adding a shim Spawn
module last night that has the same (or at least very close) interface, but uses Unix.create_process_env
under the hood. The only problem I faced is that Spawn.spawn
takes a cwd
argument that calls chdir
before exec
ing. Unix.create_process*
doesn't support this, but maybe just using Unix.chdir
before and after would work.
Coincidentally, looking at the dune 3.11 release notes, it looks like @rgrinberg patched dune's vendored version of spawn
to use posix_spawn
on macOS in ocaml/dune#8090 -- maybe it's possible that same vendored version could be used here/an ocaml-community
fork of spawn
could be made?
Regarding an ocaml-lsp workaround: I think that it would be easier to look at the usage context for spawn
inside ocaml-lsp. The one instance I found (hopefully the only one?) is in fact calling /bin/sh -c 'cmd'
. If we are assuming that bash
is available, it should be easy to turn this into a Sys.command
call that basically does /bin/sh -c (cd dir && cmd)
. (In general doing chdir
in a portable way is tricky, but here the code is already assuming sh
is available.)
(But yes it would be even easier to reuse other people's work to fix similar issues. It seems unfortunate that dune's improvement to the spawn
library are currently only available to dune, rather than other users of the library.)
Unfortunately it looks like there are numerous uses of Spawn.spawn
in the LSP server
That's why I thought it would be a good idea to stub out a compatibility module for Unix.create_process_env
, but when I tried the cd
trick I got another bug that I couldn't seem to root out so I was trying to think of alternatives.
If I understand correctly, Sys.command
blocks the thread it runs under (since it returns the exit code rather than the PID), so it's not a drop-in replacement for the spawn
function. Granted, many invocations (like the one you mentioned) immediately call waitpid
after but there are some that don't (see the ocamlformat.ml
example).
I created a branch (https://github.com/skolemlabs/ocaml-lsp/tree/use-dune-vendored-spawn) that uses the vendored version of spawn
from dune as a temporary fix. Obviously it's not ideal to have two separate forks of the library but in case anyone else comes across this issue in the interim, running
opam pin add ocaml-lsp-server git@github.com:skolemlabs/ocaml-lsp.git#use-dune-vendored-spawn
seems to fix the issue for now.
It's great to have a workaround, thanks @zbaylin ! It might be worth it to release a version of ocaml-lsp with that workaround since the issue probably impact all users that upgrade to Sonoma ?
I am still somewhat surprised that we don't have more report of this issue, I will upgrade to Sonoma and try to reproduce.
Do you have an Intel of M1/2 mac ?
Just finished the upgrade, I wasn't able to reproduce so far.
It might be worth it to release a version of ocaml-lsp with that workaround since the issue probably impact all users that upgrade to Sonoma ?
Sure, that sounds like a good idea to me. I'm not a huge fan of vendoring the exact same forked library twice, so I would suggest again that (even if long term) it would be a good idea to fork spawn
into ocaml-community
or somewhere. That obviously presents some issues with OPAM releases, etc. but it would mean not being beholden to Jane Street's release schedule.
Do you have an Intel of M1/2 mac ?
I have 2021 MacBook Pro with an M1 Pro. Here's a screenshot of About this Mac with the serial number redacted.
I also have a 14-inch 2021. I upgraded yesterday to Sonoma, and have no issue so far with ocaml-lsp 🤔.
I created a branch (https://github.com/skolemlabs/ocaml-lsp/tree/use-dune-vendored-spawn) that uses the vendored version of
spawn
from dune as a temporary fix. Obviously it's not ideal to have two separate forks of the library but in case anyone else comes across this issue in the interim, runningopam pin add ocaml-lsp-server git@github.com:skolemlabs/ocaml-lsp.git#use-dune-vendored-spawn
seems to fix the issue for now.
This fixed it for me as well. Had same issues with LSP, Neovim. Ocaml 5.1. osx Sonoma.
@frejsoya is the project your're working on open source ? Is there a way for us to reproduce ?
@voodoos
No unfortunately not right know, i'll share something if it happens again. I hoped having one more hint at os x + Sonoma would help. Sorry for the "me too" :).
I'm still on @zbaylin changes. I'll see if i can get a stack trace using the current release during the week. That is if i can reproduce :)
@frejsoya, @zbaylin, PR ocaml/ocaml#12646 backports some 5.0 changes to 4.14 that may solve the issue.
Would you try using this branch of OCaml and see if you are still able to reproduce ? The branch is here: https://github.com/xavierleroy/ocaml/tree/thread-atfork
:LspInfo
:Yesterday (after upgrading to macOS 14), the LSP appeared to repeatedly freeze after some small period of time editing. Looking at the LSP logs from Neovim, it appears that every LSP request is responded to with a JSON RPC cancellation (-32800).
I saw #1182, but pinning merlin-lib doesn't seem to fix this issue unfortunately.
The logs above don't seem to provide a ton of information, so let me know if there's somewhere else I can look for more verbose logs.