ocaml / ocaml-lsp

OCaml Language Server Protocol implementation
Other
766 stars 121 forks source link

[nvim] all RPC requests appear to be cancelled #1194

Open zbaylin opened 1 year ago

zbaylin commented 1 year ago

Yesterday (after upgrading to macOS 14), the LSP appeared to repeatedly freeze after some small period of time editing. Looking at the LSP logs from Neovim, it appears that every LSP request is responded to with a JSON RPC cancellation (-32800).

I saw #1182, but pinning merlin-lib doesn't seem to fix this issue unfortunately.

The logs above don't seem to provide a ton of information, so let me know if there's somewhere else I can look for more verbose logs.

zbaylin commented 1 year ago

Update: I didn't realize when I made this issue that version constraints in my local switch forced me to use 1.15.1-4.14. Upgrading to 1.16.2 seems to fix the issue, but not 100% sure.

zbaylin commented 1 year ago

Can confirm that this issue is still present on 1.16.2 -- it also looks like CPU usage for the process shoots up to about 100% when the cancellations start to occur.

voodoos commented 1 year ago

Thanks for you report @zbaylin

Is your project public ? Is there any way for us to reproduce the issue ?

zbaylin commented 1 year ago

Unfortunately not, let me see if I can make a repro repo.

zbaylin commented 1 year ago

@voodoos unfortunately I wasn't able to come up with an easily reproducible example. However, I did maybe get some more information? I attached lldb to a running "frozen" ocamllsp process. As soon as I did that, the following breakpoint was triggered (even though I hadn't set any):

(lldb) process attach --pid 38276
Process 38276 stopped
* thread #1, stop reason = EXC_BREAKPOINT (code=1, subcode=0x188124220)
    frame #0: 0x0000000188124220 libsystem_platform.dylib`_os_unfair_lock_corruption_abort + 52
libsystem_platform.dylib`:
->  0x188124220 <+52>: brk    #0x1
    0x188124224 <+56>: stp    x20, x21, [sp, #-0x10]!
    0x188124228 <+60>: adrp   x20, 0
    0x18812422c <+64>: add    x20, x20, #0x95d          ; "BUG IN CLIENT OF LIBPLATFORM: os_unfair_lock is corrupt"
Target 0: (ocamllsp) stopped.
Executable module set to "[...]/ocamllsp".
Architecture set to: arm64-apple-macosx-.

Continuing the process immediately hits the same breakpoint. Here's the backtrace from lldb:

* thread #1, stop reason = EXC_BREAKPOINT (code=1, subcode=0x188124220)
  * frame #0: 0x0000000188124220 libsystem_platform.dylib`_os_unfair_lock_corruption_abort + 52
    frame #1: 0x000000018811f788 libsystem_platform.dylib`_os_unfair_lock_lock_slow + 332
    frame #2: 0x00000001880ee3f0 libsystem_pthread.dylib`pthread_mutex_destroy + 64
    frame #3: 0x0000000104585ccc ocamllsp`caml_thread_reinitialize [inlined] st_mutex_destroy(m=0x000060000199c040) at st_posix.h:228:8 [opt]
    frame #4: 0x0000000104585cc4 ocamllsp`caml_thread_reinitialize at st_stubs.c:425:7 [opt]
    frame #5: 0x00000001880f6f7c libsystem_pthread.dylib`_pthread_atfork_child_handlers + 76
    frame #6: 0x0000000187faeb90 libsystem_c.dylib`fork + 112
    frame #7: 0x0000000104584a0c ocamllsp`spawn_unix(v_env=<unavailable>, v_cwd=<unavailable>, v_prog=<unavailable>, v_argv=<unavailable>, v_stdin=<unavailable>, v_stdout=<unavailable>, v_stderr=<unavailable>, v_use_vfork=<unavailable>, v_setpgid=1) at spawn_stubs.c:439:43 [opt]
    frame #8: 0x00000001045bb790 ocamllsp`caml_c_call + 28
    frame #9: 0x00000001043da178 ocamllsp`camlSpawn__spawn_inner_997 + 152
    frame #10: 0x0000000104133d24 ocamllsp`camlOcaml_lsp_server__run_in_directory_5864 + 500
    frame #11: 0x00000001041c6ed0 ocamllsp`camlMerlin_kernel__Pparse__apply_rewriter_786 + 296
    frame #12: 0x00000001041c73d0 ocamllsp`camlMerlin_kernel__Pparse__rewrite_957 + 144
    frame #13: 0x00000001041c744c ocamllsp`camlMerlin_kernel__Pparse__apply_rewriters_str_inner_1399 + 68
    frame #14: 0x00000001041c751c ocamllsp`camlMerlin_kernel__Pparse__apply_rewriters_1089 + 52
    frame #15: 0x00000001041dcd0c ocamllsp`camlMerlin_kernel__Mppx__fun_1203 + 60
    frame #16: 0x00000001041dcbd0 ocamllsp`camlMerlin_kernel__Mppx__code_begin + 136
    frame #17: 0x00000001041c7fc4 ocamllsp`camlMerlin_kernel__Phase_cache__apply_inner_528 + 852
    frame #18: 0x00000001041de448 ocamllsp`camlMerlin_kernel__Mpipeline__fun_1778 + 152
    frame #19: 0x00000001043c4578 ocamllsp`camlMerlin_utils__Misc__try_finally_inner_3715 + 48
    frame #20: 0x000000010452b10c ocamllsp`camlCamlinternalLazy__force_lazy_block_362 + 140
    frame #21: 0x00000001041dcfd4 ocamllsp`camlMerlin_kernel__Mpipeline__fun_1613 + 172
    frame #22: 0x000000010452b10c ocamllsp`camlCamlinternalLazy__force_lazy_block_362 + 140
    frame #23: 0x00000001041de4ec ocamllsp`camlMerlin_kernel__Mpipeline__fun_1791 + 60
    frame #24: 0x000000010452b10c ocamllsp`camlCamlinternalLazy__force_lazy_block_362 + 140
    frame #25: 0x00000001041dcfd4 ocamllsp`camlMerlin_kernel__Mpipeline__fun_1613 + 172
    frame #26: 0x000000010452b10c ocamllsp`camlCamlinternalLazy__force_lazy_block_362 + 140
    frame #27: 0x00000001041a3470 ocamllsp`camlQuery_commands__dispatch_2992 + 7320
    frame #28: 0x00000001043bf4cc ocamllsp`camlMerlin_utils__Std__let_ref_3289 + 60
    frame #29: 0x00000001043c4578 ocamllsp`camlMerlin_utils__Misc__try_finally_inner_3715 + 48
    frame #30: 0x0000000104569818 ocamllsp`camlStdlib__Fun__protect_317 + 96
    frame #31: 0x00000001041d3480 ocamllsp`camlMerlin_kernel__Mocaml__with_state_438 + 80
    frame #32: 0x00000001040fa9b4 ocamllsp`camlOcaml_lsp_server__Document__fun_4120 + 116
    frame #33: 0x0000000104513508 ocamllsp`camlStdune__Exn_with_backtrace__try_with_422 + 40
    frame #34: 0x00000001043dc6e0 ocamllsp`camlLev_fiber__do_no_raise_1549 + 24
    frame #35: 0x00000001043e7820 ocamllsp`camlLev_fiber_util__Worker__run_322 + 72
    frame #36: 0x00000001043e8de4 ocamllsp`camlThread__fun_850 + 44
    frame #37: 0x00000001045bb800 ocamllsp`caml_start_program + 104
    frame #38: 0x00000001045b1770 ocamllsp`caml_callback_exn(closure=<unavailable>, arg=1) at callback.c:111:10 [opt]
    frame #39: 0x0000000104585f38 ocamllsp`caml_thread_start(arg=<unavailable>) at st_stubs.c:549:5 [opt]
    frame #40: 0x00000001880f3034 libsystem_pthread.dylib`_pthread_start + 136

(Sorry for the spam, but I hope that helps) It's unclear to me whether this is a bug in merlin or ocaml-lsp from the above, but it does seem like a deadlock is happening somewhere

voodoos commented 1 year ago

I am wondering whether this could be linked to the OS update. Could you try to reproduce on a freshly built switch ?

zbaylin commented 1 year ago

I did build a brand new switch after first encountering this issue but sadly it's still present

zbaylin commented 1 year ago

@voodoos I was able to somewhat fix this (I think) by removing this line: https://github.com/ocaml/ocaml-lsp/blob/672ca1bfbb816ff6b5f07a3f3abdecb66f982d8c/ocaml-lsp-server/src/ocaml_lsp_server.ml#L920

which does square with @gasche's suggestion in the above linked PR that this might be an issue with the spawn library. Is there a reason the merlin version is replaced by a custom implementation? I would imagine spawn would be quicker than using Sys.command but have no evidence to back that up.

voodoos commented 1 year ago

Interesting, so it might be an issue with the Spawn library, or the way we use it. Still you didn't notice anything wrong before the OS update ?

The hang happens when running ppxes. Could you check that this isn't specific to the ppx your using ? I guess buffers without ppxes shouldn't hang and buffers with any ppxes should ?

Also @rgrinberg may have more hindsight here ?

gasche commented 1 year ago

To summarize my reasoning on the upstream issue https://github.com/ocaml/ocaml/issues/12636 :

(Note: the issue may be fixed in OCaml 5 (rather than OCaml 4.14), because the implementation of threads is different and may interact better with forking in this case.)

zbaylin commented 1 year ago

Thanks @gasche for the summary! In the linked PR it seems Xavier Leroy (avoiding tagging) has confirmed that this is a bug in the 4.14 runtime

The mutexes that protect I/O channels are being destroyed in the child process after fork() even though they can be locked, which is an error according to the POSIX standard and seems to be turned into a fatal error by the pthreads library.

Having not used posix_spawn before I'm hesitant to suggest using it would be a temporary fix of the issue here, but from what I have read it does seem possible:

Since glibc 2.24, the posix_spawn() function commences by calling clone(2) with CLONE_VM and CLONE_VFORK flags. 1

If CLONE_VM is set, the calling process and the child process run in the same memory space. 2


I looked into adding a shim Spawn module last night that has the same (or at least very close) interface, but uses Unix.create_process_env under the hood. The only problem I faced is that Spawn.spawn takes a cwd argument that calls chdir before execing. Unix.create_process* doesn't support this, but maybe just using Unix.chdir before and after would work.

zbaylin commented 1 year ago

Coincidentally, looking at the dune 3.11 release notes, it looks like @rgrinberg patched dune's vendored version of spawn to use posix_spawn on macOS in ocaml/dune#8090 -- maybe it's possible that same vendored version could be used here/an ocaml-community fork of spawn could be made?

gasche commented 1 year ago

Regarding an ocaml-lsp workaround: I think that it would be easier to look at the usage context for spawn inside ocaml-lsp. The one instance I found (hopefully the only one?) is in fact calling /bin/sh -c 'cmd'. If we are assuming that bash is available, it should be easy to turn this into a Sys.command call that basically does /bin/sh -c (cd dir && cmd). (In general doing chdir in a portable way is tricky, but here the code is already assuming sh is available.)

gasche commented 1 year ago

(But yes it would be even easier to reuse other people's work to fix similar issues. It seems unfortunate that dune's improvement to the spawn library are currently only available to dune, rather than other users of the library.)

zbaylin commented 1 year ago

Unfortunately it looks like there are numerous uses of Spawn.spawn in the LSP server

That's why I thought it would be a good idea to stub out a compatibility module for Unix.create_process_env, but when I tried the cd trick I got another bug that I couldn't seem to root out so I was trying to think of alternatives.

If I understand correctly, Sys.command blocks the thread it runs under (since it returns the exit code rather than the PID), so it's not a drop-in replacement for the spawn function. Granted, many invocations (like the one you mentioned) immediately call waitpid after but there are some that don't (see the ocamlformat.ml example).

zbaylin commented 1 year ago

I created a branch (https://github.com/skolemlabs/ocaml-lsp/tree/use-dune-vendored-spawn) that uses the vendored version of spawn from dune as a temporary fix. Obviously it's not ideal to have two separate forks of the library but in case anyone else comes across this issue in the interim, running

opam pin add ocaml-lsp-server git@github.com:skolemlabs/ocaml-lsp.git#use-dune-vendored-spawn

seems to fix the issue for now.

voodoos commented 1 year ago

It's great to have a workaround, thanks @zbaylin ! It might be worth it to release a version of ocaml-lsp with that workaround since the issue probably impact all users that upgrade to Sonoma ?

I am still somewhat surprised that we don't have more report of this issue, I will upgrade to Sonoma and try to reproduce.

Do you have an Intel of M1/2 mac ?

voodoos commented 1 year ago

Just finished the upgrade, I wasn't able to reproduce so far.

zbaylin commented 1 year ago

It might be worth it to release a version of ocaml-lsp with that workaround since the issue probably impact all users that upgrade to Sonoma ?

Sure, that sounds like a good idea to me. I'm not a huge fan of vendoring the exact same forked library twice, so I would suggest again that (even if long term) it would be a good idea to fork spawn into ocaml-community or somewhere. That obviously presents some issues with OPAM releases, etc. but it would mean not being beholden to Jane Street's release schedule.

Do you have an Intel of M1/2 mac ?

I have 2021 MacBook Pro with an M1 Pro. Here's a screenshot of About this Mac with the serial number redacted.

Screenshot 2023-10-10 at 10 05 27 AM
voodoos commented 1 year ago

I also have a 14-inch 2021. I upgraded yesterday to Sonoma, and have no issue so far with ocaml-lsp 🤔.

frejsoya commented 1 year ago

I created a branch (https://github.com/skolemlabs/ocaml-lsp/tree/use-dune-vendored-spawn) that uses the vendored version of spawn from dune as a temporary fix. Obviously it's not ideal to have two separate forks of the library but in case anyone else comes across this issue in the interim, running

opam pin add ocaml-lsp-server git@github.com:skolemlabs/ocaml-lsp.git#use-dune-vendored-spawn

seems to fix the issue for now.

This fixed it for me as well. Had same issues with LSP, Neovim. Ocaml 5.1. osx Sonoma.

voodoos commented 1 year ago

@frejsoya is the project your're working on open source ? Is there a way for us to reproduce ?

frejsoya commented 1 year ago

@voodoos

No unfortunately not right know, i'll share something if it happens again. I hoped having one more hint at os x + Sonoma would help. Sorry for the "me too" :).

I'm still on @zbaylin changes. I'll see if i can get a stack trace using the current release during the week. That is if i can reproduce :)

voodoos commented 1 year ago

@frejsoya, @zbaylin, PR ocaml/ocaml#12646 backports some 5.0 changes to 4.14 that may solve the issue.

Would you try using this branch of OCaml and see if you are still able to reproduce ? The branch is here: https://github.com/xavierleroy/ocaml/tree/thread-atfork