Closed stephen-soltesz closed 7 years ago
Checking to see if there are other error states, I also checked mlab1.yyz01
and have a better gdb trace. Unfortunately, I don't understand why this differs between the akl and yyz.
The trace is consistent across the three PIDs checked on yyz.
Missing separate debuginfos, use: debuginfo-install iupui_ndt-4.0.0.1-201608111719.mlab.i686
(gdb) where
#0 0x00917424 in __kernel_vsyscall ()
#1 0x00fc65fb in read () from /lib/libpthread.so.0
#2 0x00665fe8 in ?? () from /usr/lib/libcrypto.so.10
#3 0x00663fde in BIO_read () from /usr/lib/libcrypto.so.10
#4 0x00135e7a in ssl3_read_n () from /usr/lib/libssl.so.10
#5 0x0013664d in ssl3_read_bytes () from /usr/lib/libssl.so.10
#6 0x00131874 in ?? () from /usr/lib/libssl.so.10
#7 0x0014e41f in SSL_read () from /usr/lib/libssl.so.10
#8 0x08059b0d in readn_ssl ()
#9 0x08059c31 in readn_any ()
#10 0x08067c1a in recv_websocket_msg ()
#11 0x08067faf in recv_websocket_ndt_msg ()
#12 0x0805a144 in recv_any_msg ()
#13 0x08064bd9 in test_s2c ()
#14 0x0804c70a in run_test ()
#15 0x0804efd6 in child_process ()
#16 0x0804f413 in spawn_new_child ()
#17 0x0804f667 in NDT_server_main_loop ()
#18 0x08050900 in main ()
It appears some processes have multiple threads, each with an independent stack trace.
Here are traces for both threads on a hung process from syd01. Thread 1 looks like the trace above. Thread 2 is new information.
(gdb) info threads
2 Thread 0xb7742b70 (LWP 3773) 0x00b5d424 in __kernel_vsyscall ()
* 1 Thread 0xb7743730 (LWP 3770) 0x00b5d424 in __kernel_vsyscall ()
(gdb) thread 2
[Switching to thread 2 (Thread 0xb7742b70 (LWP 3773))]#0 0x00b5d424 in __kernel_vsyscall ()
(gdb) bt
#0 0x00b5d424 in __kernel_vsyscall ()
#1 0x005bff81 in select () from /lib/libc.so.6
#2 0x0805bc40 in mysleep ()
#3 0x0805171e in cputimeWorker ()
#4 0x00bc4b39 in start_thread () from /lib/libpthread.so.0
#5 0x005c7c2e in clone () from /lib/libc.so.6
(gdb) thread 1
[Switching to thread 1 (Thread 0xb7743730 (LWP 3770))]#0 0x00b5d424 in __kernel_vsyscall ()
(gdb) bt
#0 0x00b5d424 in __kernel_vsyscall ()
#1 0x00bcb5fb in read () from /lib/libpthread.so.0
#2 0x001ddfe8 in ?? () from /usr/lib/libcrypto.so.10
#3 0x001dbfde in BIO_read () from /usr/lib/libcrypto.so.10
#4 0x00a05e7a in ssl3_read_n () from /usr/lib/libssl.so.10
#5 0x00a0664d in ssl3_read_bytes () from /usr/lib/libssl.so.10
#6 0x00a01874 in ?? () from /usr/lib/libssl.so.10
#7 0x00a1e41f in SSL_read () from /usr/lib/libssl.so.10
#8 0x08059b0d in readn_ssl ()
#9 0x08059c31 in readn_any ()
#10 0x08067c1a in recv_websocket_msg ()
#11 0x08067faf in recv_websocket_ndt_msg ()
#12 0x0805a144 in recv_any_msg ()
#13 0x08064bd9 in test_s2c ()
#14 0x0804c70a in run_test ()
#15 0x0804efd6 in child_process ()
#16 0x0804f413 in spawn_new_child ()
#17 0x0804f667 in NDT_server_main_loop ()
#18 0x08050900 in main ()
We believe this bug is caused by at least three factors:
In this combined scenario a read blocks on a half-open connection, an alarm fires while the s2c handler is in place, but instead of exiting, the read is merely interrupted, and reasonably gets retried but, in this case, ends up blocking indefinitely.
I agree with the above scenario, but it needs an additional condition: the use of SSL. Because the s2c test solely writes, in order for the child process to get stuck in a read state, we need to also trigger an SSL protocol event during the writing process.
Fixing either condition 1 or condition 3 should fix the problem, but we are fixing both.
Verified.
In lieu of a version of
ps
in CentOS 6 that supports-oetime
, in Dec 2015, we used this script that reads from/proc/<pid>/stat
directly to check for long-running and cpu-hungry NDT children.Recently, @nkinkade re-ran a modified version of this script which only checked for long-running children, i.e. removing
&& cpu_time > 600
. We found ~1000 processes across the fleet with elapsed time greater than a day.The individual times range from 1 to 44 days (the number of days since the last NDT deployment). Distribution of times appears fairly uniform. There are 77 sites affected.
hung_ndtd_children.txt from Oct 6th. The columns are:
<server> <cputime_in_100HZ> <elapsed_time_in_100Hz> <PID> ... *junk* ...
Convert elapsed_time_in_100Hz to days by:
etime / 100 / (60 * 60 * 24)
Looking at the first few rows from mlab1.akl01, the ndt children appear to hold sockets in states
ESTABLISHED
orLISTEN
. For example, from mlab.akl01 (netstat -ntlp | grep ndtd
)PID 6910 has a process elapsed time of over 15 days. The output from strace & gdb are basically the same for the other two processes (different fds).
According to
strace
, it is blocked in aread
.Unfortunately, our build of NDT does not include debug symbols so a more specific stack trace is unavailable: