Closed mbacarella closed 4 years ago
The "wild speculation" part is probably not a lead. I tried a toy program that attempts to have the same thread acquire the master lock (leave blocking section) twice and it hangs trying to lock the second time.
This may be due to an OS bug. More discussion here: https://discuss.ocaml.org/t/is-there-a-known-recent-linux-locking-bug-that-affects-the-ocaml-runtime/6542/5
This may actually be a still unpatched bug in glibc.
That's some impressive detective work! Thanks mbac!
Looks like there's plenty of action to be taken related to this, but luckily not within async-related repos. Folks have been notified of the issue internally, so hopefully we can avoid this glibc version.
Thanks for reporting this, and for your persistence in tracking down the ultimate cause.
Thanks!
@seliopou To be clear, the way to avoid it is to not upgrade (or to patch your glibc yourself). The bug was introduced into glibc 2.27 about 4 years ago and persists to this day.
I'm running into what seems like a deadlock in Async somewhere.
Begin big copy/paste of what happens when you attach
gdb
to the deadlocked process and runinfo threads
andbt
all of the threads.This is compiled against jane v0.13.0, OCaml 4.08 flambda.
It's in a long-running process and the deadlock happens pretty unpredictably so I'm a bit slow to iterate the debugging.
The most interesting part is the program is long-running but watches stdin for input using a
Reader.read_line
and when the deadlock happens, the only thread that's not blocked on waiting for the OCaml runtime lock is thatReader.read_line
. Instead, it's blocked in libcread
, and hitting Enter actually resumes everything.Working backwards, that libc
read
appears to beBigstring.read
, suggesting it's both blocked in a libcread
and holding the runtime lock. Though, looking at the code, it's not obvious how it could actually acquire the lock but still block in read. This code seems pretty tight:As said above, hitting Enter will actually resume all of the threads that are blocked waiting on the OCaml runtime/masterlock waits. Confirmed it by looking at the call stacks again in gdb.
Speculating wildly: You could imagine contriving a situation where the inner while loop exits due to some arcane errno condition, the blocking section is left (acquiring the runtime lock), but the raises somehow don't terminate the C function and it instead nexts the outer loop. Then the read is re-entered while holding the runtime lock, and the only way out is to press Enter (assuming the same thread can call leave blocking section twice). Though that would be really contrived.
I don't mean to overly implicate what must be a very well worn and tested
Bigstring.read
, just seems like the only lead so far.Anyway, to make this even more strange, these deadlocks don't appear at all on Ubuntu 16.04 but do appear on Ubuntu 18.04.
Thanks for any insight you can offer.