nasa / trick

Trick Simulation Environment. Trick provides a common set of simulation capabilities and utilities to build simulations automatically.
Other
26 stars 15 forks source link

Deadlock occurs between variable server creating threads and main thread shutdown #1656

Closed cameronbracco closed 4 months ago

cameronbracco commented 4 months ago

Overview

We recently upgraded to Trick version 19.7.1 from version 19.4.0 and we began running into a regular failure where the trick sim would fail to shut down properly and hang infinitely. We root caused this to a deadlock between the main thread and the VariableServerListenThread during shutdown. We believe this was introduced as part of the new thread management/shutdown behavior in https://github.com/nasa/trick/pull/1448.

Investigation

After running our trick sim with gdb, we discovered that the main thread was stuck in the shutdown loop while it was attempting to join all the SysThreads. More specifically, the main thread was stuck joining the VariableServerListenThread which is in turn handling a new connection request. At this point in time the main thread holds the lock for the list_mutex(), which controls access to all_sys_threads(). When the VariableServerListenThread creates a VariableServerSessionThread it attempts to add that object into all_sys_threads() , which means it must acquire list_mutex() . This creates a deadlock where the main thread holds the list_mutex() and is waiting to join the VariableServerListenThread, which is in turn blocked on acquiring to lock for list_mutex().

We reached this conclusion by following stack traces. When the process deadlocks and we hit Ctrl + C in gdb, we can see the following stack trace:

Thread 1 "S_main_Linux_8." received signal SIGINT, Interrupt.
0x00007f0f085ce6cd in __pthread_timedjoin_ex () from target:/lib64/libpthread.so.0
(gdb) where
#0  0x00007f0f085ce6cd in __pthread_timedjoin_ex () from target:/lib64/libpthread.so.0
#1  0x0000000003c93447 in Trick::ThreadBase::join_thread (this=0x7a742b0 <trick_vs+176>) at ThreadBase.cpp:320
#2  0x0000000003c916fb in Trick::SysThread::ensureAllShutdown () at SysThread.cpp:65
#3  0x0000000003c34152 in Trick::Executive::shutdown (this=0x7a71f28 <trick_sys+168>) at Executive_shutdown.cpp:81
#4  0x0000000003d28695 in master (nargs=4, args=0x7fffea970f68) at master.cpp:47
#5  0x000000000077d957 in main (nargs=4, args=0x7fffea970f68) at build/S_source.cpp:1385

From here, we can see that the main thread is stuck in ensureAllShutdown() attempting to join threads. Our next line of inquiry was to figure out which thread was being joined. Jumping into frame 2 and printing the thread object shows us that it is the VariableServerListenThread:

(gdb) frame 2
#2  0x0000000003c916fb in Trick::SysThread::ensureAllShutdown () at SysThread.cpp:65
65              thread->join_thread();
(gdb) p *thread
$1 = {<Trick::ThreadBase> = {_vptr.ThreadBase = 0x57ea628 <vtable for Trick::VariableServerListenThread+16>, name = {
      static npos = 18446744073709551615, 
      _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, 
        _M_p = 0x7a742c8 <trick_vs+200> "VarServListen"}, _M_string_length = 13, {_M_local_buf = "VarServListen\000\000", 
        _M_allocated_capacity = 5509716973651648854}}, pthread_id = 139700228560640, pid = 88, rt_priority = 0, max_cpu = 16, 
    created = true, should_shutdown = true, cancellable = false, shutdown_mutex = {__data = {__lock = 0, __count = 0, 
        __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, 
      __size = '\000' <repeats 39 times>, __align = 0}, cpus = 0xd740000}, _restart_pause_mutex = {__data = {__lock = 0, 
      __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, 
    __size = '\000' <repeats 39 times>, __align = 0}, _thread_should_pause = false, _thread_wakeup_cv = {__data = {{__wseq = 0, 
        __wseq32 = {__low = 0, __high = 0}}, {__g1_start = 0, __g1_start32 = {__low = 0, __high = 0}}, __g_refs = {0, 0}, 
      __g_size = {0, 0}, __g1_orig_size = 0, __wrefs = 0, __g_signals = {0, 0}}, __size = '\000' <repeats 47 times>, 
    __align = 0}, _thread_has_paused_cv = {__data = {{__wseq = 0, __wseq32 = {__low = 0, __high = 0}}, {__g1_start = 0, 
        __g1_start32 = {__low = 0, __high = 0}}, __g_refs = {0, 0}, __g_size = {0, 0}, __g1_orig_size = 0, __wrefs = 0, 
      __g_signals = {0, 0}}, __size = '\000' <repeats 47 times>, __align = 0}, _thread_has_paused = false, 
  static shutdown_finished = false}

Now that we know the thread we can list the threads out and switch to the appropriate thread so we can see why that thread is not closing:

(gdb) info thread
  Id   Target Id                       Frame 
* 1    Thread 42.42 "S_main_Linux_8."  0x0000000003c916fb in Trick::SysThread::ensureAllShutdown () at SysThread.cpp:65
  2    Thread 42.88 "VarServListen"    0x00007f0f085d682d in __lll_lock_wait () from target:/lib64/libpthread.so.0
  3    Thread 42.72 "rCo27887####Dtb"  0x00007f0f085d3848 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from target:/lib64/libpthread.so.0
  4    Thread 42.73 "rCo27887####Evt"  0x00007f0f085d3848 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from target:/lib64/libpthread.so.0
  5    Thread 42.74 "rTr27887UDP4ITr"  0x00007f0f06131301 in poll () from target:/lib64/libc.so.6
  6    Thread 42.75 "rCo27887##00Rcv"  0x00007f0f085d6dfa in recvfrom () from target:/lib64/libpthread.so.0
  7    Thread 42.76 "rCo27887##01Rcv"  0x00007f0f0613d4cb in semop () from target:/lib64/libc.so.6
... (more threads)
...
(gdb) thread 2

Now that we've switched to the VarServListen thread we can again look at the stack trace:

(gdb) where
#0  0x00007f0f085d682d in __lll_lock_wait () from target:/lib64/libpthread.so.0
#1  0x00007f0f085cfad9 in pthread_mutex_lock () from target:/lib64/libpthread.so.0
#2  0x0000000003c9147f in Trick::SysThread::SysThread (this=0x7f0e78002b70, in_name=...) at SysThread.cpp:41
#3  0x0000000003ca5c16 in Trick::VariableServerSessionThread::VariableServerSessionThread (this=0x7f0e78002b70, 
    session=0x7f0e78002050) at VariableServerSessionThread.cpp:20
#4  0x0000000003ca5b9d in Trick::VariableServerSessionThread::VariableServerSessionThread (this=0x7f0e78002b70)
    at VariableServerSessionThread.cpp:17
#5  0x0000000003c9b289 in Trick::VariableServerListenThread::thread_body (this=0x7a742b0 <trick_vs+176>)
    at VariableServerListenThread.cpp:163
#6  0x0000000003c9369b in Trick::ThreadBase::thread_helper (context=0x7a742b0 <trick_vs+176>) at ThreadBase.cpp:375
#7  0x00007f0f085cd1ca in start_thread () from target:/lib64/libpthread.so.0
#8  0x00007f0f06046e73 in clone () from target:/lib64/libc.so.6

This stack trace shows us that the VariableServerListenThread has received a new connection and that it is in the process of constructing a VariableServerSessionThread , which in turn creates a SysThread, which is what attempts to acquire list_mutex() and add to all_sys_threads().

I suspect this is all happening due to our setup that regularly attempts connect a new variable server client if the previous connection dropped. In this case, the connection dropped because the sim was being shut down properly, but the attempt at re-connection deadlocked the shutdown process. I am able to reliably reproduce the issue with our sim setup, but I have not yet attempted to reproduce the issue on one of the demo sims. I would recommend that someone looking to reproduce send a large amount of variable server connection requests directly after requesting shutdown. 

drgardner43 commented 4 months ago

I was able to re-create this with the aircraft sim - GUI disabled, short duration, disabled realtime, and a standalone python subprocess that would hammer the variable server. Not 100% repeatable either, its about 1/3.

(gdb) thread apply all bt

Thread 2 (Thread 0x7f05fafbf640 (LWP 681091) "VarServListen"):
#0  futex_wait (private=0, expected=2, futex_word=0x564741086180 <Trick::SysThread::list_mutex()::list_mutex>) at ../sysdeps/nptl/futex-internal.h:146
#1  __GI___lll_lock_wait (futex=futex@entry=0x564741086180 <Trick::SysThread::list_mutex()::list_mutex>, private=0) at ./nptl/lowlevellock.c:49
#2  0x00007f05fc4a8002 in lll_mutex_lock_optimized (mutex=0x564741086180 <Trick::SysThread::list_mutex()::list_mutex>) at ./nptl/pthread_mutex_lock.c:48
#3  ___pthread_mutex_lock (mutex=0x564741086180 <Trick::SysThread::list_mutex()::list_mutex>) at ./nptl/pthread_mutex_lock.c:93
#4  0x00005647409aa4ae in Trick::SysThread::SysThread(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) ()
#5  0x00005647409ba98c in Trick::VariableServerSessionThread::VariableServerSessionThread(Trick::VariableServerSession*) ()
#6  0x00005647409ba8fa in Trick::VariableServerSessionThread::VariableServerSessionThread() ()
#7  0x00005647409b8693 in Trick::VariableServerListenThread::thread_body() ()
#8  0x00005647409ace98 in Trick::ThreadBase::thread_helper(void*) ()
#9  0x00007f05fc4a4ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#10 0x00007f05fc536850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 1 (Thread 0x7f05fc3be740 (LWP 681089) "S_main_Linux_11"):
#0  __futex_abstimed_wait_common64 (private=128, cancel=true, abstime=0x0, op=265, expected=681091, futex_word=0x7f05fafbf910) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=128, abstime=0x0, clockid=0, expected=681091, futex_word=0x7f05fafbf910) at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7f05fafbf910, expected=681091, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=128) at ./nptl/futex-internal.c:139
#3  0x00007f05fc4a6624 in __pthread_clockjoin_ex (threadid=139663662380608, thread_return=0x0, clockid=0, abstime=0x0, block=<optimized out>) at ./nptl/pthread_join_common.c:105
#4  0x00005647409acbec in Trick::ThreadBase::join_thread() ()
#5  0x00005647409aa771 in Trick::SysThread::ensureAllShutdown() ()
#6  0x0000564740940973 in Trick::Executive::shutdown() ()
#7  0x0000564740a4c8b2 in master(int, char**) ()
#8  0x000056474086157b in main ()
(gdb) 
keithvetter commented 4 months ago

I ran into this as well, here is a report I gave:

I have a koviz variable server client connect to a sim's variable server. If I shutdown the sim from the sim control panel, then exit the sim control panel, the panel closes, but the S_main.exe remains running, and I have to manually kill it. I broke into the S_main.exe process and see:

Main Thread:

0 in __pthread_timedjoin_ex () from /lib64/libpthread.so.0

1 in Trick::ThreadBase::join_thread at ThreadBase.cpp:320

2 in Trick::SysThread::ensureAllShutdown () at SysThread.cpp:65

3 Trick::Executive::shutdown at Executive_shutdown.cpp:81

4 in master at master.cpp:47

Child Thread:

0 in __lll_lock_wait () from /lib64/libpthread.so.0

1 in pthread_mutex_lock () from /lib64/libpthread.so.0

2 in Trick::SysThread::SysThread ("VarServer6") at SysThread.cpp:41

3 in Trick::VariableServerSessionThread::VariableServerSessionThread at VariableServerSessionThread.cpp:20

If I shutdown the koviz vs client before I shutdown the sim, the sim gracefully shuts down.

If one wants to recreate the issue using koviz, let me know. I can get it to hang almost every time. The description of hanging on reconnect makes sense. Koviz tries to reconnect after the sim drops. The theory that the hang occurs due to a reconnect during shutdown sounds correct.

sharmeye commented 4 months ago

We'll begin looking into this one directly, thanks for the thorough reporting.

sharmeye commented 4 months ago

Thank you all for the very thorough and useful bug report. We are working on a fix now and it will be released with the next minor Trick release. At that time, please let us know if you still experience the same issue.