unbit / uwsgi

uWSGI application server container
http://projects.unbit.it/uwsgi
Other
3.46k stars 691 forks source link

uWSGI listen queue full: advice on figuring out what is happening? #2444

Closed dstndstn closed 2 years ago

dstndstn commented 2 years ago

Hi,

I'm running an Ubuntu 20.04 on an OpenStack VM, Intel broadwell CPU. Apache 2.4. Uwsgi 2.0.20.

uwsgi run (in a script that is run by systemd, as a normal user):

uwsgi -s :3030 --wsgi-file wsgi.py --touch-reload wsgi.py \
      --processes 8 --reload-on-rss 768 -d /data/nova/uwsgi.log \
      --limit-post 500000000 --stats 127.0.0.1:1717 \
      --log-format "[pid: %(pid)|worker: %(wid)|req: -/-] %(addr) [%(ctime)] %(method) %(uri) => generated %(rsize) bytes in %(msecs) msecs (%(proto) %(status)) %(headers) headers in %(hsize) bytes (%(switches) switches on core %(core))"

After a day or two on my ~busy site (few requests/second), I am getting:

Tue May 31 13:51:26 2022 - *** uWSGI listen queue of socket ":3030" (fd: 4) full !!! (100/100) ***
Tue May 31 13:51:27 2022 - *** uWSGI listen queue of socket ":3030" (fd: 4) full !!! (100/100) ***
Tue May 31 13:51:28 2022 - *** uWSGI listen queue of socket ":3030" (fd: 4) full !!! (101/100) ***
Tue May 31 13:51:29 2022 - *** uWSGI listen queue of socket ":3030" (fd: 4) full !!! (101/100) ***

repeated. In top, I don't see any uWSGI processes using CPU, and I think they're in state "S" in ps. It doesn't look like it's busy doing anything, it looks like it's just stalled.

Previously, I was getting dmesg entries about CPU soft lock-ups -- flaky hardware underlying the VM? -- but in today's edition of this problem I don't think I see that.

Looking in the Apache logs, things seem normal leading up to this time, and then I start seeing a bunch of requests that end with http status 500 after 300 seconds.

I'm feeling lost about what to look at to figure out what is going on. I'm still not 100% sure it's not hardware-related. One odd thing I'm seeing is that the process ids are large, like the current uwsgi process is pid 2618919. Maybe that's just life on a working server, but it seemed unusual.

Any hints on things to look into, stats to collect, logging to do, things to investigate next time this happens, would be very much appreciated!

Thank you, dustin

dstndstn commented 2 years ago

Sorry, forgot to say: upon restarting uwsgi (service nova-uwsgi restart), things work fine again.

And with --show-config, the log file starts with:

*** Starting uWSGI 2.0.20 (64bit) on [Tue May 31 14:05:48 2022] ***
compiled with version: 9.3.0 on 18 January 2022 15:56:23
os: Linux-5.4.0-113-generic #127-Ubuntu SMP Wed May 18 14:30:56 UTC 2022
nodename: nova
machine: x86_64
clock source: unix
pcre jit disabled
detected number of CPU cores: 16
current working directory: /home/nova/astrometry/net
detected binary path: /usr/local/bin/uwsgi
your processes number limit is 482928
your memory page size is 4096 bytes
detected max file descriptor number: 1024
lock engine: pthread robust mutexes
thunder lock: disabled (you can enable it with --thunder-lock)
uwsgi socket 0 bound to TCP address :3030 fd 4
Python version: 3.8.10 (default, Mar 15 2022, 12:22:08)  [GCC 9.4.0]
*** Python threads support is disabled. You can enable it with --enable-threads ***
Python main interpreter initialized at 0x5557afda4680
your server socket listen backlog is limited to 100 connections
your mercy for graceful operations on workers is 60 seconds
mapped 656136 bytes (640 KB) for 8 cores
*** Operational MODE: preforking ***
Path: /home/nova/astrometry
Logging to /data/nova/nova.log
WSGI app 0 (mountpoint='') ready in 17 seconds on interpreter 0x5557afda4680 pid: 2597129 (default app)
*** uWSGI is running in multiple interpreter mode ***
spawned uWSGI master process (pid: 2597129)
spawned uWSGI worker 1 (pid: 2597149, cores: 1)
spawned uWSGI worker 2 (pid: 2597150, cores: 1)
spawned uWSGI worker 3 (pid: 2597151, cores: 1)
spawned uWSGI worker 4 (pid: 2597152, cores: 1)
spawned uWSGI worker 5 (pid: 2597153, cores: 1)
spawned uWSGI worker 6 (pid: 2597154, cores: 1)
spawned uWSGI worker 7 (pid: 2597155, cores: 1)
spawned uWSGI worker 8 (pid: 2597156, cores: 1)
*** Stats server enabled on 127.0.0.1:1717 fd: 24 ***
[pid: 2597152|app: 0|req: 1/2] 99.250.167.114 () {76 vars in 1374 bytes} [Tue May 31 14:06:05 2022] GET / => generated 14924 bytes in 1398 msecs (HTTP/1.1 200) 7 headers in 223 bytes (1 switches on core 0
)

and I guess I should mention that this is a Python/Django site.

Thanks!

dstndstn commented 2 years ago

When this happened again today, I tried attaching gdb to the uwsgi processes, and it's showing here:

#0  __libc_read (nbytes=5, buf=0x56257206f723, fd=10) at ../sysdeps/unix/sysv/linux/read.c:26
#1  __libc_read (fd=10, buf=0x56257206f723, nbytes=5) at ../sysdeps/unix/sysv/linux/read.c:24
#2  0x00007fc851a683d9 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.1
#3  0x00007fc851a6367e in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.1
#4  0x00007fc851a624d4 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.1
#5  0x00007fc851a62aa7 in BIO_read () from /lib/x86_64-linux-gnu/libcrypto.so.1.1
#6  0x00007fc851cb4b91 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.1
#7  0x00007fc851cb8e1e in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.1
#8  0x00007fc851cb66d0 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.1
#9  0x00007fc851ce9a31 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.1
#10 0x00007fc851cdf625 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.1
#11 0x00007fc851ccad98 in SSL_do_handshake () from /lib/x86_64-linux-gnu/libssl.so.1.1
#12 0x00007fc851d3b04c in ?? () from /usr/lib/python3.8/lib-dynload/_ssl.cpython-38-x86_64-linux-gnu.so
#13 0x00007fc852abc25b in ?? () from /lib/x86_64-linux-gnu/libpython3.8.so.1.0
#14 0x00007fc85288fd6d in ?? () from /lib/x86_64-linux-gnu/libpython3.8.so.1.0
#15 0x00007fc852891018 in _PyEval_EvalFrameDefault () from /lib/x86_64-linux-gnu/libpython3.8.so.1.0
#16 0x00007fc8529e5e3b in _PyEval_EvalCodeWithName () from /lib/x86_64-linux-gnu/libpython3.8.so.1.0
#17 0x00007fc852ac3114 in _PyFunction_Vectorcall () from /lib/x86_64-linux-gnu/libpython3.8.so.1.0
...
#207 0x00007fc852ac5407 in _PyObject_FastCallDict () from /lib/x86_64-linux-gnu/libpython3.8.so.1.0
#208 0x00007fc852ac551d in _PyObject_Call_Prepend () from /lib/x86_64-linux-gnu/libpython3.8.so.1.0
#209 0x00007fc852a5fbe6 in ?? () from /lib/x86_64-linux-gnu/libpython3.8.so.1.0
#210 0x00007fc852ac3984 in PyObject_Call () from /lib/x86_64-linux-gnu/libpython3.8.so.1.0
#211 0x000056256dfa2304 in python_call ()
#212 0x000056256dfa4705 in uwsgi_request_wsgi ()
#213 0x000056256df3ebbd in wsgi_req_recv ()
#214 0x000056256df89d74 in simple_loop_run ()
#215 0x000056256df89b54 in simple_loop ()
#216 0x000056256df8e5f4 in uwsgi_ignition ()
#217 0x000056256df92d16 in uwsgi_worker_run ()
#218 0x000056256df93294 in uwsgi_run ()
#219 0x000056256df3e224 in main ()

Doesn't that seem weird? It's doing an SSL handshake??

Looking at lsof for that process, I see

COMMAND     PID USER   FD      TYPE             DEVICE   SIZE/OFF      NODE NAME
uwsgi   2631165 nova    0r      CHR                1,3        0t0         6 /dev/null
uwsgi   2631165 nova    1u      REG             252,32   86586098 270095362 /data/nova/uwsgi.log
uwsgi   2631165 nova    2u      REG             252,32   86586098 270095362 /data/nova/uwsgi.log
uwsgi   2631165 nova    3u     unix 0xffff91454be73400        0t0  25639965 type=DGRAM
uwsgi   2631165 nova    4u     IPv4           25630232        0t0       TCP *:3030 (LISTEN)
uwsgi   2631165 nova    5u  a_inode               0,14          0     10511 [eventpoll]
uwsgi   2631165 nova    6u     sock                0,9        0t0  28514896 protocol: TCP
uwsgi   2631165 nova    7w      REG             252,32 1334720725 270073863 /data/nova/nova.log
uwsgi   2631165 nova    8u     IPv4           28514897        0t0       TCP localhost:59832->localhost:postgresql (ESTABLISHED)
uwsgi   2631165 nova    9u      REG             252,32          0 258163796 /data/tmp/tmp7d2xj9aj
uwsgi   2631165 nova   10u     IPv4           28514901        0t0       TCP nova.arbutus:39176->101.70.154.119:https (ESTABLISHED)
uwsgi   2631165 nova   20u     unix 0xffff9145a7125800        0t0  25630298 type=STREAM
uwsgi   2631165 nova   22u     unix 0xffff9145a7125c00        0t0  25630303 type=STREAM

and fd 10 is an outgoing connection from my server (nova) on https...?

dstndstn commented 2 years ago

Time to set up a cron job to restart uwsgi every day...

dstndstn commented 2 years ago

(Meanwhile, the main process is here:

(gdb) where
#0  0x00007fc85274846e in epoll_wait (epfd=23, events=0x7ffc348661bc, maxevents=1, timeout=1000) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x000056256df80f19 in event_queue_wait ()
#2  0x000056256df53116 in master_loop ()
#3  0x000056256df930ea in uwsgi_run ()
#4  0x000056256df3e224 in main ()

and its lsof shows

COMMAND     PID USER   FD      TYPE             DEVICE   SIZE/OFF      NODE NAME
uwsgi   2631137 nova    0r      CHR                1,3      0t0         6 /dev/null
uwsgi   2631137 nova    1u      REG             252,32 86639558 270095362 /data/nova/uwsgi.log
uwsgi   2631137 nova    2u      REG             252,32 86639558 270095362 /data/nova/uwsgi.log
uwsgi   2631137 nova    3u     unix 0xffff91454be73400      0t0  25639965 type=DGRAM
uwsgi   2631137 nova    4u     IPv4           25630232      0t0       TCP *:3030 (LISTEN)
uwsgi   2631137 nova    5u     unix 0xffff9145a7123c00      0t0  25630283 type=STREAM
uwsgi   2631137 nova    6u     unix 0xffff9145a7126400      0t0  25630284 type=STREAM
uwsgi   2631137 nova    7u     unix 0xffff9145a7127000      0t0  25630285 type=STREAM
uwsgi   2631137 nova    8u     unix 0xffff9145a7121c00      0t0  25630286 type=STREAM
uwsgi   2631137 nova    9u     unix 0xffff9145a7126000      0t0  25630287 type=STREAM
uwsgi   2631137 nova   10u     unix 0xffff9145a7121400      0t0  25630288 type=STREAM
uwsgi   2631137 nova   11u     unix 0xffff9145a7123400      0t0  25630289 type=STREAM
uwsgi   2631137 nova   12u     unix 0xffff9145a7124c00      0t0  25630290 type=STREAM
uwsgi   2631137 nova   13u     unix 0xffff9145a7122c00      0t0  25630291 type=STREAM
uwsgi   2631137 nova   14u     unix 0xffff9145a7121800      0t0  25630292 type=STREAM
uwsgi   2631137 nova   15u     unix 0xffff9145a7120000      0t0  25630293 type=STREAM
uwsgi   2631137 nova   16u     unix 0xffff9145a7122400      0t0  25630294 type=STREAM
uwsgi   2631137 nova   17u     unix 0xffff9145a7126800      0t0  25630295 type=STREAM
uwsgi   2631137 nova   18u     unix 0xffff9145a7124400      0t0  25630296 type=STREAM
uwsgi   2631137 nova   19u     unix 0xffff9145a7123800      0t0  25630297 type=STREAM
uwsgi   2631137 nova   20u     unix 0xffff9145a7125800      0t0  25630298 type=STREAM
uwsgi   2631137 nova   21u     unix 0xffff9145a7124000      0t0  25630302 type=STREAM
uwsgi   2631137 nova   22u     unix 0xffff9145a7125c00      0t0  25630303 type=STREAM
uwsgi   2631137 nova   23u  a_inode               0,14        0     10511 [eventpoll]
uwsgi   2631137 nova   24u     IPv4           25609936      0t0       TCP localhost:1717 (LISTEN)
dstndstn commented 2 years ago

Apologies! This was me DOS'ing myself by, inadvertently, having my code urlopen() a URL passed in by the client. Turns out that's not a good idea.