m-lab / ndt

Network Diagnostic Tool
Other
11 stars 7 forks source link

Lingering NDT child processes #70

Closed stephen-soltesz closed 7 years ago

stephen-soltesz commented 7 years ago

In lieu of a version of ps in CentOS 6 that supports -oetime, in Dec 2015, we used this script that reads from /proc/<pid>/stat directly to check for long-running and cpu-hungry NDT children.

Recently, @nkinkade re-ran a modified version of this script which only checked for long-running children, i.e. removing && cpu_time > 600. We found ~1000 processes across the fleet with elapsed time greater than a day.

The individual times range from 1 to 44 days (the number of days since the last NDT deployment). Distribution of times appears fairly uniform. There are 77 sites affected.

hung_ndtd_children.txt from Oct 6th. The columns are: <server> <cputime_in_100HZ> <elapsed_time_in_100Hz> <PID> ... *junk* ...

Convert elapsed_time_in_100Hz to days by: etime / 100 / (60 * 60 * 24)

Looking at the first few rows from mlab1.akl01, the ndt children appear to hold sockets in states ESTABLISHED or LISTEN. For example, from mlab.akl01 (netstat -ntlp | grep ndtd)

Active Internet connections 
Proto Recv-Q Send-Q Local Address               Foreign Address             State       PID/Program name   
tcp        0      0 0.0.0.0:42950               0.0.0.0:*                   LISTEN      17852/ndtd          
tcp        0      0 0.0.0.0:50504               0.0.0.0:*                   LISTEN      12547/ndtd          
tcp        0      0 0.0.0.0:45395               0.0.0.0:*                   LISTEN      6910/ndtd  
...
tcp        0      0 163.7.129.11:3010           121.98.138.62:61517         ESTABLISHED 6910/ndtd           
tcp        0      0 163.7.129.11:3010           49.181.164.142:32268        ESTABLISHED 17852/ndtd          
tcp        0      0 163.7.129.11:45395          121.98.138.62:61525         FIN_WAIT2   6910/ndtd           
tcp        0      0 163.7.129.11:3010           172.195.207.250:49359       ESTABLISHED 12547/ndtd          

PID 6910 has a process elapsed time of over 15 days. The output from strace & gdb are basically the same for the other two processes (different fds).

According to strace, it is blocked in a read.

# strace -p 6910
Process 6910 attached - interrupt to quit
read(15, 

Unfortunately, our build of NDT does not include debug symbols so a more specific stack trace is unavailable:

# gdb /home/iupui_ndt/build/sbin/ndtd --pid=6910
(gdb) where
#0  0x0056d424 in __kernel_vsyscall ()
#1  0x00c2f48b in __lll_robust_timedlock_wait () from /lib/libpthread.so.0
#2  0xbfba867c in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
stephen-soltesz commented 7 years ago

Checking to see if there are other error states, I also checked mlab1.yyz01 and have a better gdb trace. Unfortunately, I don't understand why this differs between the akl and yyz.

The trace is consistent across the three PIDs checked on yyz.

Missing separate debuginfos, use: debuginfo-install iupui_ndt-4.0.0.1-201608111719.mlab.i686
(gdb) where
#0  0x00917424 in __kernel_vsyscall ()
#1  0x00fc65fb in read () from /lib/libpthread.so.0
#2  0x00665fe8 in ?? () from /usr/lib/libcrypto.so.10
#3  0x00663fde in BIO_read () from /usr/lib/libcrypto.so.10
#4  0x00135e7a in ssl3_read_n () from /usr/lib/libssl.so.10
#5  0x0013664d in ssl3_read_bytes () from /usr/lib/libssl.so.10
#6  0x00131874 in ?? () from /usr/lib/libssl.so.10
#7  0x0014e41f in SSL_read () from /usr/lib/libssl.so.10
#8  0x08059b0d in readn_ssl ()
#9  0x08059c31 in readn_any ()
#10 0x08067c1a in recv_websocket_msg ()
#11 0x08067faf in recv_websocket_ndt_msg ()
#12 0x0805a144 in recv_any_msg ()
#13 0x08064bd9 in test_s2c ()
#14 0x0804c70a in run_test ()
#15 0x0804efd6 in child_process ()
#16 0x0804f413 in spawn_new_child ()
#17 0x0804f667 in NDT_server_main_loop ()
#18 0x08050900 in main ()
stephen-soltesz commented 7 years ago

It appears some processes have multiple threads, each with an independent stack trace.

Here are traces for both threads on a hung process from syd01. Thread 1 looks like the trace above. Thread 2 is new information.

(gdb) info threads
  2 Thread 0xb7742b70 (LWP 3773)  0x00b5d424 in __kernel_vsyscall ()
* 1 Thread 0xb7743730 (LWP 3770)  0x00b5d424 in __kernel_vsyscall ()
(gdb) thread 2
[Switching to thread 2 (Thread 0xb7742b70 (LWP 3773))]#0  0x00b5d424 in __kernel_vsyscall ()
(gdb) bt
#0  0x00b5d424 in __kernel_vsyscall ()
#1  0x005bff81 in select () from /lib/libc.so.6
#2  0x0805bc40 in mysleep ()
#3  0x0805171e in cputimeWorker ()
#4  0x00bc4b39 in start_thread () from /lib/libpthread.so.0
#5  0x005c7c2e in clone () from /lib/libc.so.6
(gdb) thread 1
[Switching to thread 1 (Thread 0xb7743730 (LWP 3770))]#0  0x00b5d424 in __kernel_vsyscall ()
(gdb) bt
#0  0x00b5d424 in __kernel_vsyscall ()
#1  0x00bcb5fb in read () from /lib/libpthread.so.0
#2  0x001ddfe8 in ?? () from /usr/lib/libcrypto.so.10
#3  0x001dbfde in BIO_read () from /usr/lib/libcrypto.so.10
#4  0x00a05e7a in ssl3_read_n () from /usr/lib/libssl.so.10
#5  0x00a0664d in ssl3_read_bytes () from /usr/lib/libssl.so.10
#6  0x00a01874 in ?? () from /usr/lib/libssl.so.10
#7  0x00a1e41f in SSL_read () from /usr/lib/libssl.so.10
#8  0x08059b0d in readn_ssl ()
#9  0x08059c31 in readn_any ()
#10 0x08067c1a in recv_websocket_msg ()
#11 0x08067faf in recv_websocket_ndt_msg ()
#12 0x0805a144 in recv_any_msg ()
#13 0x08064bd9 in test_s2c ()
#14 0x0804c70a in run_test ()
#15 0x0804efd6 in child_process ()
#16 0x0804f413 in spawn_new_child ()
#17 0x0804f667 in NDT_server_main_loop ()
#18 0x08050900 in main ()
stephen-soltesz commented 7 years ago

We believe this bug is caused by at least three factors:

  1. NDT sockets do not have IO timeouts set, so can block indefinitely.
  2. "half-open" TCP connections can remain in an ESTABLISHED state when clients cannot cleanly shutdown the TCP connection (e.g, closing a laptop, or entering an elevator and losing signal).
  3. resetting the SIGALRM handler in the s2c test. Since it acts as a watchdog, the default SIGALRM signal handler exits when an alarm fires. The s2c handler does not exit. This second signal handler appears to be vestigial, so can safely be removed.

In this combined scenario a read blocks on a half-open connection, an alarm fires while the s2c handler is in place, but instead of exiting, the read is merely interrupted, and reasonably gets retried but, in this case, ends up blocking indefinitely.

pboothe commented 7 years ago

I agree with the above scenario, but it needs an additional condition: the use of SSL. Because the s2c test solely writes, in order for the child process to get stuck in a read state, we need to also trigger an SSL protocol event during the writing process.

pboothe commented 7 years ago

Fixing either condition 1 or condition 3 should fix the problem, but we are fixing both.

pboothe commented 7 years ago

https://github.com/m-lab/ndt/pull/71

stephen-soltesz commented 7 years ago

Verified.