Deadlock in tgen (liblapack)

sporksmith commented 2 years ago

On my local machine, tgen is deadlocking at startup in shadow. It starts several threads, which appear to all be in a loop calling sched_yield and rdtsc; maybe some kind of spin lock? Here's a stack trace from one of the sched_yield calls:

(gdb) bt
<snip>
#14 0x00007ffff7fb8d88 in sched_yield (a=140737336181520, b=0, c=0, d=0, e=0, f=140737104509472)
    at /home/jnewsome/projects/shadow/dev/src/lib/libc_preload/syscall_wrappers.c:786
#15 0x00007ffff56f0286 in ?? () from /lib/x86_64-linux-gnu/liblapack.so.3
#16 0x00007ffff7458609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#17 0x00007ffff79ac293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Experimentally, changing sched_yield to move time forward eventually lets tgen move forward, but it seems strange that tgen has this spin-wait. It appears to be at the start of some liblapack worker thread, but I'm not familiar with liblapack, and I don't see where in tgen it's getting invoked.

If I attach gdb to tgen when it starts, and set a breakpoint on clone, it looks like the thread is created through some global initializer:

(gdb) bt
#0  clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:50
#1  0x00007ffff74602ec in create_thread (pd=pd@entry=0x7fffe9218700, attr=attr@entry=0x7fffffffea60, stopped_start=stopped_start@entry=0x7fffffffea5e, stackaddr=stackaddr@entry=0x7fffe91f9b40, thread_ran=thread_ran@entry=0x7fffffffea5f)
    at ../sysdeps/unix/sysv/linux/createthread.c:101
#2  0x00007ffff7461e10 in __pthread_create_2_1 (newthread=<optimized out>, attr=<optimized out>, start_routine=<optimized out>, arg=<optimized out>) at pthread_create.c:817
#3  0x00007ffff56f959c in blas_thread_init () from /lib/x86_64-linux-gnu/liblapack.so.3
#4  0x00007ffff4fcd083 in gotoblas_init () from /lib/x86_64-linux-gnu/liblapack.so.3
#5  0x00007ffff7fe0b8a in ?? () from /lib64/ld-linux-x86-64.so.2
#6  0x00007ffff7fe0c91 in ?? () from /lib64/ld-linux-x86-64.so.2
#7  0x00007ffff7fd013a in ?? () from /lib64/ld-linux-x86-64.so.2
#8  0x0000000000000002 in ?? ()
#9  0x00007fffffffed92 in ?? ()
#10 0x00007fffffffedb3 in ?? ()
#11 0x0000000000000000 in ?? ()

I suppose the next step is to look at the source of liblapack to find the source of the thread's starting function, which should have the wait-loop.

Btw I'm not sure why this is only happening on my machine and not in the CI. I'm running Ubuntu 20.04, same as the tor test in the CI:

$ lsb_release --all
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.3 LTS
Release:    20.04
Codename:   focal

sporksmith commented 2 years ago

Identified the package providing the blas library:

$ dpkg -S /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3
libopenblas0-pthread:amd64: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3

Downloaded it:

$ apt-get source libopenblas0-pthread

Found the definition of blas_thread_init. It starts threads in the function blas_thread_server. This looks like the loop where it's getting stuck: https://github.com/dagss/gotoblas2/blob/97e39982c701c1ce547af1d908672e00499a39bb/driver/others/blas_server.c#L260

Next is probably install debug symbols for this package and use gdb to step through this function to help understand what's going on. https://wiki.ubuntu.com/DebuggingProgramCrash#Debug_Symbol_Packages

sporksmith commented 2 years ago

In an ubuntu shadow docker container built with ci/run.sh, we end up getting liblapack from package liblapack3, instead of libopenblas0-pthread.

# ldd build/src/main/shadow | grep lapack
    liblapack.so.3 => /lib/x86_64-linux-gnu/liblapack.so.3 (0x00007f8514820000)
# realpath /lib/x86_64-linux-gnu/liblapack.so.3
/usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
# dpkg -S /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
liblapack3:amd64: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

Stopping this avenue here, but it seems likely that tgen also ends up getting built against this alternate liblapack, and that version doesn't use pthread worker threads and so doesn't have this issue.

sporksmith commented 2 years ago

There are a few implementations of rpcc, but the one in common_x86.h uses rdtsc, which explains the rdstsc emulation I see in the shim log of the deadlock. That should be ok since we catch and emulate rdtsc. It might not map to real time in the expected way if it's a CPU that shadow doesn't know how to get the expected clock rate, but afaict it just uses it as an arbitrary time measure to "wait a bit".

Probably still worth stepping through gdb with symbols to verify my understanding and check what timeout it's using, but ultimately having sched_yield (and maybe just every syscall) move time forward by some fixed amount seems like the right general solution.

sporksmith commented 2 years ago

So the timeout is measured in clock ticks, and is hard-coded to 1<<28. i.e. on a 1 GHz CPU that's ~0.27s.

#ifndef THREAD_TIMEOUT
#define THREAD_TIMEOUT  28
#endif

static unsigned int thread_timeout = (1U << (THREAD_TIMEOUT));

Verified this in gdb:

(gdb) print thread_timeout
$11 = 268435456

That's an awfully long time to spin. I was thinking something on the order of a microsecond would be a reasonable amount to move time forward on a sched_yield, but that'd mean 270,000 sched_yield calls before the loop timed out and blocked. Yikes.

Whether or not we support spin-waiting by moving the clock forward on syscalls, we probably want to try to avoid this particular one in practice. This also raises the issue that transparently supporting spin-waiting in this way will potentially mask these issues - allowing shadow to eventually complete, but slowly, instead of deadlocking. In some ways deadlocking is better, since it makes it clearer there's a problem, and once fixed, the sim will run faster.

If we do move time forward on sched_yield and/or all syscalls, it probably ought to be optional, and maybe not on by default.

sporksmith commented 2 years ago

AFAICT the loop is waiting for the work queue to be initialized, and it looks like that only happens in exec_blas_async - when some work is dispatched to the library. So the loop spinning until it times out is probably pretty typical behavior. It wouldn't be as noticeable outside of shadow, since the sched_yield in the loop lets other things run if there's anything else to run. Even outside of Shadow though, this seems to unnecessarily burn a fair bit of CPU at startup if there's nothing else ready to run.

sporksmith commented 2 years ago

https://github.com/xianyi/OpenBLAS/issues/2543 and https://github.com/xianyi/OpenBLAS/wiki/faq#multi-threaded document some of the issues with threads and libopenblas in general, and notes that they can be disabled by setting the environment variable OPENBLAS_NUM_THREADS=1.

I've confirmed that doing so in the tgen processes gets around the deadlock on my machine.

Whether or not we add the time-skip in sched_yield, this is probably best-practice when running under Shadow.

sporksmith commented 2 years ago

I opened https://github.com/shadow/shadow/issues/1792 to track the broader issue of sched_yield-based waiting. I plan to document the openblas-specific issue and workarounds to close this issue.

shadow / shadow

Deadlock in tgen (liblapack) #1788