Closed sporksmith closed 2 years ago
Identified the package providing the blas library:
$ dpkg -S /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3
libopenblas0-pthread:amd64: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3
Downloaded it:
$ apt-get source libopenblas0-pthread
Found the definition of blas_thread_init
. It starts threads in the function blas_thread_server
. This looks like the loop where it's getting stuck: https://github.com/dagss/gotoblas2/blob/97e39982c701c1ce547af1d908672e00499a39bb/driver/others/blas_server.c#L260
Next is probably install debug symbols for this package and use gdb to step through this function to help understand what's going on. https://wiki.ubuntu.com/DebuggingProgramCrash#Debug_Symbol_Packages
In an ubuntu shadow docker container built with ci/run.sh
, we end up getting liblapack from package liblapack3
, instead of libopenblas0-pthread
.
# ldd build/src/main/shadow | grep lapack
liblapack.so.3 => /lib/x86_64-linux-gnu/liblapack.so.3 (0x00007f8514820000)
# realpath /lib/x86_64-linux-gnu/liblapack.so.3
/usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
# dpkg -S /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
liblapack3:amd64: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
Stopping this avenue here, but it seems likely that tgen also ends up getting built against this alternate liblapack, and that version doesn't use pthread worker threads and so doesn't have this issue.
There are a few implementations of rpcc
, but the one in common_x86.h
uses rdtsc
, which explains the rdstsc
emulation I see in the shim log of the deadlock. That should be ok since we catch and emulate rdtsc
. It might not map to real time in the expected way if it's a CPU that shadow doesn't know how to get the expected clock rate, but afaict it just uses it as an arbitrary time measure to "wait a bit".
Probably still worth stepping through gdb with symbols to verify my understanding and check what timeout it's using, but ultimately having sched_yield
(and maybe just every syscall) move time forward by some fixed amount seems like the right general solution.
So the timeout is measured in clock ticks, and is hard-coded to 1<<28. i.e. on a 1 GHz CPU that's ~0.27s.
#ifndef THREAD_TIMEOUT
#define THREAD_TIMEOUT 28
#endif
static unsigned int thread_timeout = (1U << (THREAD_TIMEOUT));
Verified this in gdb:
(gdb) print thread_timeout
$11 = 268435456
That's an awfully long time to spin. I was thinking something on the order of a microsecond would be a reasonable amount to move time forward on a sched_yield
, but that'd mean 270,000 sched_yield
calls before the loop timed out and blocked. Yikes.
Whether or not we support spin-waiting by moving the clock forward on syscalls, we probably want to try to avoid this particular one in practice. This also raises the issue that transparently supporting spin-waiting in this way will potentially mask these issues - allowing shadow to eventually complete, but slowly, instead of deadlocking. In some ways deadlocking is better, since it makes it clearer there's a problem, and once fixed, the sim will run faster.
If we do move time forward on sched_yield
and/or all syscalls, it probably ought to be optional, and maybe not on by default.
AFAICT the loop is waiting for the work queue to be initialized, and it looks like that only happens in exec_blas_async
- when some work is dispatched to the library. So the loop spinning until it times out is probably pretty typical behavior. It wouldn't be as noticeable outside of shadow, since the sched_yield
in the loop lets other things run if there's anything else to run. Even outside of Shadow though, this seems to unnecessarily burn a fair bit of CPU at startup if there's nothing else ready to run.
https://github.com/xianyi/OpenBLAS/issues/2543 and https://github.com/xianyi/OpenBLAS/wiki/faq#multi-threaded document some of the issues with threads and libopenblas in general, and notes that they can be disabled by setting the environment variable OPENBLAS_NUM_THREADS=1
.
I've confirmed that doing so in the tgen
processes gets around the deadlock on my machine.
Whether or not we add the time-skip in sched_yield
, this is probably best-practice when running under Shadow.
I opened https://github.com/shadow/shadow/issues/1792 to track the broader issue of sched_yield
-based waiting. I plan to document the openblas-specific issue and workarounds to close this issue.
On my local machine, tgen is deadlocking at startup in shadow. It starts several threads, which appear to all be in a loop calling
sched_yield
andrdtsc
; maybe some kind of spin lock? Here's a stack trace from one of thesched_yield
calls:Experimentally, changing
sched_yield
to move time forward eventually lets tgen move forward, but it seems strange that tgen has this spin-wait. It appears to be at the start of someliblapack
worker thread, but I'm not familiar withliblapack
, and I don't see where in tgen it's getting invoked.If I attach gdb to tgen when it starts, and set a breakpoint on
clone
, it looks like the thread is created through some global initializer:I suppose the next step is to look at the source of liblapack to find the source of the thread's starting function, which should have the wait-loop.
Btw I'm not sure why this is only happening on my machine and not in the CI. I'm running Ubuntu 20.04, same as the tor test in the CI: