unbit / uwsgi

uWSGI application server container
http://projects.unbit.it/uwsgi
Other
3.46k stars 691 forks source link

max-worker-lifetime isn't recycling workers #1760

Open judgeaxl opened 6 years ago

judgeaxl commented 6 years ago

Using max-worker-lifetime doesn't seem to work.

It will correctly log that the lifetime has ended, but it doesn't actually restart the workers, which means it'll never try again. Both the PID and last_spawned remain from the initial launch and are never updated.

The code in master_checks.c send the SIGWINCH signal when the lifetime has passed, but the signal is only listened to by a couple of empty functions in emperor.c and master_utils.c. It's also listened to in pty.c, but there it actually seems to look at the terminal size, and it has nothing to do with worker management.

I'm not very well read up on POSIX signals, so maybe I'm not following this properly, but looking at the other limit checks, they all seem to either kill() the processes, or use SIGHUP via the uwsgi_curse() method.

I'm running in master mode, with 8 processes, 1 thread, gevent > 1, and no other limiters configured.

mthu commented 4 years ago

I am observing the same behavior. 1 worker, 10 threads, master is on.

When max-worker-lifetime triggers, the worker is terminated but not immediately reloaded (as it was notified with SIGHUP or SIGTERM). As the result, I got response with HTTP code 500 to the first request after such termination (without any interesting information in the Apache nor Uwsgi log - I am using mod_proxy_uwsgi). Then the worker respawns and runs ok.

terencehonles commented 3 years ago

This may be related to #1221. I am also seeing this issue and if you're not seeing a Respawned uWSGI worker in the logs it looks like there is likely a deadlock in the Python clean up code which may also be related to #1969 (although that looks like it produces a crash rather than a deadlock).

I just enabled --max-worker-lifetime and it hung my dev server. I tested and the --max-requests option does not exhibit the same behavior because it respects the worker mercy timeout and the lifetime code probably should be moved/updated to use uwsgi_curse as the @judgeaxl suggests.

If your issue is indeed a deadlock (or at least what I'm assuming is a deadlock) you can use the option --skip-atexit-teardown to bypass python finalization (see https://github.com/unbit/uwsgi/pull/1392). I'll probably try to debug what is causing the deadlock, but in order to come to this conclusion I sent SIGUSR2 to the worker process and I received the following trace:

uwsgi(uwsgi_backtrace+0x2a) [0x55d6faf5023a]
uwsgi(what_i_am_doing+0x17) [0x55d6faf50367]
/lib/x86_64-linux-gnu/libc.so.6(+0x37840) [0x7f1825afb840]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x10894) [0x7f18269e8894]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x10988) [0x7f18269e8988]
/usr/local/lib/libpython3.9.so.1.0(PyThread_acquire_lock_timed+0x4d) [0x7f1825e6f2fd]
/usr/local/lib/libpython3.9.so.1.0(+0x203e9f) [0x7f1825e8ae9f]
/usr/local/lib/libpython3.9.so.1.0(+0x14f37a) [0x7f1825dd637a]
/usr/local/lib/libpython3.9.so.1.0(_PyEval_EvalFrameDefault+0x790) [0x7f1825e46350]
/usr/local/lib/libpython3.9.so.1.0(+0x1bdfe9) [0x7f1825e44fe9]
/usr/local/lib/libpython3.9.so.1.0(_PyFunction_Vectorcall+0x19c) [0x7f1825dd146c]
/usr/local/lib/libpython3.9.so.1.0(_PyEval_EvalFrameDefault+0x790) [0x7f1825e46350]
/usr/local/lib/libpython3.9.so.1.0(_PyFunction_Vectorcall+0x102) [0x7f1825dd13d2]
/usr/local/lib/libpython3.9.so.1.0(_PyEval_EvalFrameDefault+0x790) [0x7f1825e46350]
/usr/local/lib/libpython3.9.so.1.0(_PyFunction_Vectorcall+0x102) [0x7f1825dd13d2]
/usr/local/lib/libpython3.9.so.1.0(_PyEval_EvalFrameDefault+0x790) [0x7f1825e46350]
/usr/local/lib/libpython3.9.so.1.0(+0x14a668) [0x7f1825dd1668]
/usr/local/lib/libpython3.9.so.1.0(PyVectorcall_Call+0x5c) [0x7f1825dd242c]
/usr/local/lib/libpython3.9.so.1.0(+0x269040) [0x7f1825ef0040]
/usr/local/lib/libpython3.9.so.1.0(+0xe9a61) [0x7f1825d70a61]
/usr/local/lib/libpython3.9.so.1.0(Py_FinalizeEx+0x3b) [0x7f1825ed62fb]
uwsgi(uwsgi_plugins_atexit+0x71) [0x55d6faf4e5f1]
/lib/x86_64-linux-gnu/libc.so.6(+0x39d8c) [0x7f1825afdd8c]
/lib/x86_64-linux-gnu/libc.so.6(+0x39eba) [0x7f1825afdeba]
uwsgi(+0x400ef) [0x55d6faf060ef]
uwsgi(end_me+0x25) [0x55d6faf4d295]
uwsgi(uwsgi_ignition+0x127) [0x55d6faf507d7]
uwsgi(uwsgi_worker_run+0x25e) [0x55d6faf54ffe]
uwsgi(uwsgi_run+0x434) [0x55d6faf55554]
uwsgi(+0x3cf4e) [0x55d6faf02f4e]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7f1825ae809b]
uwsgi(_start+0x2a) [0x55d6faf02f7a]

Noticing the call to uwsgi_plugins_atexit I found the option --skip-atexit-teardown and it does indeed restart properly now (and some other "NO MERCY" warnings when restarting have been removed). I'm not sure if I need the teardown code, but I figured I'd at least report that max-worker-lifetime isn't respecting worker-reload-mercy, but it looks like this issue may be that. @judgeaxl do you mind updating the title/description to include worker-reload-mercy in it?

@mthu I think your issue is possibly unrelated since @judgeaxl is describing that the process is marked as "should restart", but it is never checked up on again. Which is why I suggested this is related to #1221 because that also describes the behavior I am seeing where the system stops responding to requests and will result in gateway timeouts.

iurisilvio commented 2 years ago

Just to add a data point. I think I reached this deadlock yesterday in production. My uwsgi hanged after trying recycle workers that reached max-worker-lifetime.

I changed my production setup one week before, it did ~50 recycles/day until uwsgi hanged with logs like worker 6 lifetime reached, it was running for 43201 second(s). It totally hanged one machine and the other one I think locked only part of the workers.

Don't have much details about it, just removed the config.

I'm running it on Ubuntu 20.04 and Python 3.9.12 (recentry upgraded from 3.9.11), uwsgi 2.0.20.

Bertrand67 commented 9 months ago

We have the same issue with uwsgi 2.0.20 and Python 3.9.18. This is a really critical bug because there is no way to make sure that workers respawn correctly on production environment.

dineshtrivedi commented 8 months ago

We also experienced the same issue with uwsgi 2.0.23 and Python 3.10.2

We have experienced this in the last three days, Does anyone have any suggestions?

dineshtrivedi commented 8 months ago

For other people in case this helps you. I have fixed my issue, the problem was with my configuration and sentry version. Check this https://github.com/getsentry/sentry-python/issues/2699#issuecomment-1944336675

FYI @Bertrand67 @iurisilvio