Open judgeaxl opened 6 years ago
I am observing the same behavior. 1 worker, 10 threads, master is on.
When max-worker-lifetime
triggers, the worker is terminated but not immediately reloaded (as it was notified with SIGHUP or SIGTERM). As the result, I got response with HTTP code 500 to the first request after such termination (without any interesting information in the Apache nor Uwsgi log - I am using mod_proxy_uwsgi). Then the worker respawns and runs ok.
This may be related to #1221. I am also seeing this issue and if you're not seeing a Respawned uWSGI worker
in the logs it looks like there is likely a deadlock in the Python clean up code which may also be related to #1969 (although that looks like it produces a crash rather than a deadlock).
I just enabled --max-worker-lifetime
and it hung my dev server. I tested and the --max-requests
option does not exhibit the same behavior because it respects the worker mercy timeout and the lifetime code probably should be moved/updated to use uwsgi_curse
as the @judgeaxl suggests.
If your issue is indeed a deadlock (or at least what I'm assuming is a deadlock) you can use the option --skip-atexit-teardown
to bypass python finalization (see https://github.com/unbit/uwsgi/pull/1392). I'll probably try to debug what is causing the deadlock, but in order to come to this conclusion I sent SIGUSR2
to the worker process and I received the following trace:
uwsgi(uwsgi_backtrace+0x2a) [0x55d6faf5023a]
uwsgi(what_i_am_doing+0x17) [0x55d6faf50367]
/lib/x86_64-linux-gnu/libc.so.6(+0x37840) [0x7f1825afb840]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x10894) [0x7f18269e8894]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x10988) [0x7f18269e8988]
/usr/local/lib/libpython3.9.so.1.0(PyThread_acquire_lock_timed+0x4d) [0x7f1825e6f2fd]
/usr/local/lib/libpython3.9.so.1.0(+0x203e9f) [0x7f1825e8ae9f]
/usr/local/lib/libpython3.9.so.1.0(+0x14f37a) [0x7f1825dd637a]
/usr/local/lib/libpython3.9.so.1.0(_PyEval_EvalFrameDefault+0x790) [0x7f1825e46350]
/usr/local/lib/libpython3.9.so.1.0(+0x1bdfe9) [0x7f1825e44fe9]
/usr/local/lib/libpython3.9.so.1.0(_PyFunction_Vectorcall+0x19c) [0x7f1825dd146c]
/usr/local/lib/libpython3.9.so.1.0(_PyEval_EvalFrameDefault+0x790) [0x7f1825e46350]
/usr/local/lib/libpython3.9.so.1.0(_PyFunction_Vectorcall+0x102) [0x7f1825dd13d2]
/usr/local/lib/libpython3.9.so.1.0(_PyEval_EvalFrameDefault+0x790) [0x7f1825e46350]
/usr/local/lib/libpython3.9.so.1.0(_PyFunction_Vectorcall+0x102) [0x7f1825dd13d2]
/usr/local/lib/libpython3.9.so.1.0(_PyEval_EvalFrameDefault+0x790) [0x7f1825e46350]
/usr/local/lib/libpython3.9.so.1.0(+0x14a668) [0x7f1825dd1668]
/usr/local/lib/libpython3.9.so.1.0(PyVectorcall_Call+0x5c) [0x7f1825dd242c]
/usr/local/lib/libpython3.9.so.1.0(+0x269040) [0x7f1825ef0040]
/usr/local/lib/libpython3.9.so.1.0(+0xe9a61) [0x7f1825d70a61]
/usr/local/lib/libpython3.9.so.1.0(Py_FinalizeEx+0x3b) [0x7f1825ed62fb]
uwsgi(uwsgi_plugins_atexit+0x71) [0x55d6faf4e5f1]
/lib/x86_64-linux-gnu/libc.so.6(+0x39d8c) [0x7f1825afdd8c]
/lib/x86_64-linux-gnu/libc.so.6(+0x39eba) [0x7f1825afdeba]
uwsgi(+0x400ef) [0x55d6faf060ef]
uwsgi(end_me+0x25) [0x55d6faf4d295]
uwsgi(uwsgi_ignition+0x127) [0x55d6faf507d7]
uwsgi(uwsgi_worker_run+0x25e) [0x55d6faf54ffe]
uwsgi(uwsgi_run+0x434) [0x55d6faf55554]
uwsgi(+0x3cf4e) [0x55d6faf02f4e]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7f1825ae809b]
uwsgi(_start+0x2a) [0x55d6faf02f7a]
Noticing the call to uwsgi_plugins_atexit
I found the option --skip-atexit-teardown
and it does indeed restart properly now (and some other "NO MERCY" warnings when restarting have been removed). I'm not sure if I need the teardown code, but I figured I'd at least report that max-worker-lifetime
isn't respecting worker-reload-mercy
, but it looks like this issue may be that. @judgeaxl do you mind updating the title/description to include worker-reload-mercy
in it?
@mthu I think your issue is possibly unrelated since @judgeaxl is describing that the process is marked as "should restart", but it is never checked up on again. Which is why I suggested this is related to #1221 because that also describes the behavior I am seeing where the system stops responding to requests and will result in gateway timeouts.
Just to add a data point. I think I reached this deadlock yesterday in production. My uwsgi hanged after trying recycle workers that reached max-worker-lifetime
.
I changed my production setup one week before, it did ~50 recycles/day until uwsgi hanged with logs like worker 6 lifetime reached, it was running for 43201 second(s)
. It totally hanged one machine and the other one I think locked only part of the workers.
Don't have much details about it, just removed the config.
I'm running it on Ubuntu 20.04 and Python 3.9.12 (recentry upgraded from 3.9.11), uwsgi 2.0.20.
We have the same issue with uwsgi 2.0.20 and Python 3.9.18. This is a really critical bug because there is no way to make sure that workers respawn correctly on production environment.
We also experienced the same issue with uwsgi 2.0.23 and Python 3.10.2
We have experienced this in the last three days, Does anyone have any suggestions?
For other people in case this helps you. I have fixed my issue, the problem was with my configuration and sentry version. Check this https://github.com/getsentry/sentry-python/issues/2699#issuecomment-1944336675
FYI @Bertrand67 @iurisilvio
Using
max-worker-lifetime
doesn't seem to work.It will correctly log that the lifetime has ended, but it doesn't actually restart the workers, which means it'll never try again. Both the
PID
andlast_spawned
remain from the initial launch and are never updated.The code in
master_checks.c
send theSIGWINCH
signal when the lifetime has passed, but the signal is only listened to by a couple of empty functions inemperor.c
andmaster_utils.c
. It's also listened to in pty.c, but there it actually seems to look at the terminal size, and it has nothing to do with worker management.I'm not very well read up on POSIX signals, so maybe I'm not following this properly, but looking at the other limit checks, they all seem to either
kill()
the processes, or use SIGHUP via theuwsgi_curse()
method.I'm running in master mode, with 8 processes, 1 thread, gevent > 1, and no other limiters configured.