unbit / uwsgi

uWSGI application server container
http://projects.unbit.it/uwsgi
Other
3.45k stars 687 forks source link

pyuwsgi + enable-threads + python3.12: workers do not recover after a reload #2659

Closed asottile-sentry closed 1 week ago

asottile-sentry commented 1 month ago

reproduction:

python3.12 -m venv venv
. venv/bin/activate
pip install pyuwsgi
bash t.sh
# wsgi.py
def application(environ, start_response):
    print('.', end='', flush=True)

    status = '200 OK'
    output = b'!'

    response_headers = [('Content-type', 'text/plain'),
                        ('Content-Length', str(len(output)))]
    start_response(status, response_headers)

    return [output]
# t.sh
UWSGI_MASTER=true \
UWSGI_BINARY_PATH=$PWD/venv/bin/python3 \
UWSGI_VIRTUALENV=$PWD/venv \
UWSGI_WORKERS=2 \
UWSGI_ENABLE_THREADS=true \
UWSGI_MAX_REQUESTS=5 \
UWSGI_HTTP_SOCKET=localhost:9001 \
UWSGI_WSGI_FILE=wsgi.py \
UWSGI_DISABLE_LOGGING=1 \
    venv/bin/uwsgi
# some other tab / window
yes | head -300 | xargs --replace curl localhost:9001/health-check/; echo

output:

$ bash t.sh 
*** Starting uWSGI 2.0.26 (64bit) on [Mon Aug 12 16:57:55 2024] ***
compiled with version: Apple LLVM 15.0.0 (clang-1500.3.9.4) on 12 August 2024 17:24:49
os: Darwin-23.5.0 Darwin Kernel Version 23.5.0: Wed May  1 20:12:58 PDT 2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T6000
nodename: FJJ4YYCWYX.attlocal.net
machine: arm64
clock source: unix
pcre jit disabled
detected number of CPU cores: 10
current working directory: /private/tmp/y
detected binary path: /tmp/y/venv/bin/python3
your processes number limit is 5333
your memory page size is 16384 bytes
detected max file descriptor number: 2560
lock engine: OSX spinlocks
thunder lock: disabled (you can enable it with --thunder-lock)
uwsgi socket 0 bound to TCP address localhost:9001 fd 3
Python version: 3.12.2 (main, Feb  6 2024, 20:19:44) [Clang 15.0.0 (clang-1500.1.0.2.5)]
--- Python VM already initialized ---
Python main interpreter initialized at 0x10154c4f0
python threads support enabled
your server socket listen backlog is limited to 100 connections
your mercy for graceful operations on workers is 60 seconds
mapped 218688 bytes (213 KB) for 2 cores
*** Operational MODE: preforking ***
WSGI app 0 (mountpoint='') ready in 0 seconds on interpreter 0x10154c4f0 pid: 63841 (default app)
*** uWSGI is running in multiple interpreter mode ***
spawned uWSGI master process (pid: 63841)
spawned uWSGI worker 1 (pid: 63842, cores: 1)
spawned uWSGI worker 2 (pid: 63843, cores: 1)
................................................................................................................................................................................................................................................................................................................................................................................................................................................The work of process 63843 is done. Seeya!
....The work of process 63842 is done. Seeya!
worker 2 killed successfully (pid: 63843)
Respawned uWSGI worker 2 (new pid: 64286)
worker 1 killed successfully (pid: 63842)
Respawned uWSGI worker 1 (new pid: 64287)

(and then no additional output) -- the uwsgi processes also become relatively unkillable (not responding to ^C / ^\, SIGTERM etc.)

the other terminal output:

$ yes | head -300 | xargs --replace curl localhost:9001/health-check/; echo
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
$ yes | head -300 | xargs --replace curl localhost:9001/health-check/; echo
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

this seems to not be an issue with python 3.11 or without pyuwsgi

asottile-sentry commented 1 month ago

cc @methane since you seem to have been poking around in --enable-threads and termination recently and maybe have an idea what's going on

methane commented 1 month ago

Try #2654

asottile commented 1 month ago

tried that patch out but unfortunately it doesn't seem to fix it -- still getting hangs unfortunately

asottile commented 1 month ago

if it's helpful -- an strace of the child processes while stuck shows:

FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
futex(0xb2bd50, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=1576, tv_nsec=257230947}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
futex(0xb2bd50, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=1576, tv_nsec=262893986}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
futex(0xb2bd50, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=1576, tv_nsec=268841818}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
futex(0xb2bd50, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=1576, tv_nsec=274623638}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)

the parent process:

wait4(-1, 0x7ffc745d4098, WNOHANG, NULL) = 0
epoll_wait(10, [], 1, 1000)             = 0
getsockopt(3, SOL_TCP, TCP_INFO, "\n\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0d\0\0\0"..., [104]) = 0
wait4(-1, 0x7ffc745d4098, WNOHANG, NULL) = 0
epoll_wait(10, [], 1, 1000)             = 0
getsockopt(3, SOL_TCP, TCP_INFO, "\n\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0d\0\0\0"..., [104]) = 0
wait4(-1, 0x7ffc745d4098, WNOHANG, NULL) = 0
epoll_wait(10, [], 1, 1000)             = 0
getsockopt(3, SOL_TCP, TCP_INFO, "\n\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0d\0\0\0"..., [104]) = 0
wait4(-1, 0x7ffc745d4098, WNOHANG, NULL) = 0
epoll_wait(10, [], 1, 1000)             = 0
getsockopt(3, SOL_TCP, TCP_INFO, "\n\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0d\0\0\0"..., [104]) = 0

will see if I can figure out more

asottile commented 1 month ago

ah this is definitely more useful info -- here's the stack of the stuck children:

#0  __futex_abstimed_wait_common64 (private=-1862589240, cancel=true, 
    abstime=0x7ffc90fb2510, op=137, expected=0, 
    futex_word=0xc68c30 <_PyRuntime+93552>) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=-1862589240, 
    abstime=0x7ffc90fb2510, clockid=1913253632, expected=0, 
    futex_word=0xc68c30 <_PyRuntime+93552>) at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (
    futex_word=futex_word@entry=0xc68c30 <_PyRuntime+93552>, 
    expected=expected@entry=0, clockid=clockid@entry=1, 
    abstime=abstime@entry=0x7ffc90fb2510, private=private@entry=0)
    at ./nptl/futex-internal.c:139
#3  0x000078b1bc893e9b in __pthread_cond_wait_common (abstime=0x7ffc90fb2510, 
    clockid=1, mutex=0xc68c38 <_PyRuntime+93560>, 
    cond=0xc68c08 <_PyRuntime+93512>) at ./nptl/pthread_cond_wait.c:503
#4  ___pthread_cond_timedwait64 (cond=cond@entry=0xc68c08 <_PyRuntime+93512>, 
    mutex=mutex@entry=0xc68c38 <_PyRuntime+93560>, 
    abstime=abstime@entry=0x7ffc90fb2510) at ./nptl/pthread_cond_wait.c:652
#5  0x0000000000654a1e in PyCOND_TIMEDWAIT (us=<optimized out>, 
    mut=0xc68c38 <_PyRuntime+93560>, cond=0xc68c08 <_PyRuntime+93512>)
    at ../Python/condvar.h:73
#6  take_gil (tstate=tstate@entry=0xcc6108 <_PyRuntime+475720>)
    at ../Python/ceval_gil.c:376
#7  0x0000000000654e80 in PyEval_RestoreThread (
    tstate=0xcc6108 <_PyRuntime+475720>) at ../Python/ceval_gil.c:708
#8  0x000078b1bbe6046c in uwsgi_python_master_fixup (step=<optimized out>)
    at plugins/python/python_plugin.c:1320
#9  0x000078b1bbe14866 in uwsgi_respawn_worker (wid=wid@entry=2)
    at core/master_utils.c:757
#10 0x000078b1bbe12a32 in master_loop (argv=0x2489670, environ=<optimized out>)
    at core/master.c:1084
#11 0x000078b1bbe54290 in uwsgi_run () at core/uwsgi.c:3305
#12 0x000078b1bbe5f0b7 in pyuwsgi_run (self=<optimized out>, 
    args=args@entry=(), kwds=kwds@entry=0x0) at plugins/pyuwsgi/pyuwsgi.c:159
#13 0x0000000000534e56 in cfunction_call (
    func=func@entry=<built-in method run of module object at remote 0x78b1bbf15b50>, args=args@entry=(), kwargs=kwargs@entry=0x0)
    at ../Objects/methodobject.c:537
#14 0x00000000004ce432 in _PyObject_MakeTpCall (
    tstate=tstate@entry=0xcc6108 <_PyRuntime+475720>, 
    callable=callable@entry=<built-in method run of module object at remote 0x78b1bbf15b50>, args=args@entry=0x78b1bcc66088, nargs=<optimized out>, 

(edited with debug line numbers against 3.12.4 + 2.0.26 + #2654)

methane commented 1 month ago

Try lazy-apps.

https://uwsgi-docs.readthedocs.io/en/latest/articles/TheArtOfGracefulReloading.html#preforking-vs-lazy-apps-vs-lazy

asottile-sentry commented 1 month ago

still an issue. I'm going to try and bisect cpython today and see if I can't find what changed and where the bug lies

asottile commented 1 month ago

alrighty -- here's the result of bisection!

92d8bfffbf377e91d8b92666525cb8700bb1d5e8 is the first bad commit

https://github.com/python/cpython/commit/92d8bfffbf377e91d8b92666525cb8700bb1d5e8

asottile commented 1 month ago

created https://github.com/python/cpython/pull/123079 with a potential fix

asottile commented 2 weeks ago

btw I have a fix for this in #2660 -- if you're using pyuwsgi from pypi this fix is cherry-picked and available in 2.0.27a1

xrmx commented 1 week ago

Fixed in #2670