unbit / uwsgi

uWSGI application server container
http://projects.unbit.it/uwsgi
Other
3.45k stars 691 forks source link

uWSGI workers stop responding SIGHUP #777

Closed dmugtasimov closed 8 years ago

dmugtasimov commented 9 years ago

Please, see description here: http://stackoverflow.com/questions/27017558/why-uwsgi-workers-stop-responding-sighup

P.S. I am not sure if it a bug, but decided to file it to bring your attention.

unbit commented 9 years ago

One of the jobs of the master is getting rid of bad behaving instances. There is a time range for workers to be alive before recycled, another one to protect from fork bombing, there is a global state in the master that tracks requested-by-unix-signal operations. This is the reason why at some point the respawn is slowed down. You can experiment the same thing setting max-requests to a low value and start blasting the instance. At some point the respawn will be blocked automatically.

In addition to this access to uwsgi.workers in this way (a thread in the master) is highly racy (workers are independent by the master and the pid value you access could be meaningless). Probably this is not happening to you, but as a general rule the master should never run high-level code as it potentially make the whole infrastructure really fragile.

dmugtasimov commented 9 years ago

"This is the reason why at some point the respawn is slowed down." If I set PeriodicTimer to run (send SIGHUP to workers) every 2 seconds, then I start getting: "worker respawning too fast !!! i have to sleep a bit (2 seconds)..."

It is probably what you are talking about. Therefore I set PeriodicTimer to run every 5 seconds and no longer get "worker respawning too fast !!! i have to sleep a bit (2 seconds)...". But I still get non-responding workers.

Getting non-existing pid from uwsgi.workers is not a problem. I can check pid existence and if it a child of a master process, but this is not the case here, so I omit this checks for simplicity. Moreover, sending signal to non-exiting process will just ruin PeriodicTimer thread, but I did not experience that.

For I am doing (delivering cached data to workers by forking master process) I need to run my code in master process. I asked question here: http://stackoverflow.com/questions/26614024/call-python-code-from-uwsgi-master-cycle Now I call C function from Python and it is not a problem. I consider it is solved, but hanging workers seem strange.

Do have any comments on why some workers also receive Ctrl + C generate signal?

unbit commented 9 years ago

I fear i should see the code, very probably there is a memory corruption or some unallowed path. Are you sure you need a thread to manage your cache ? Can't the master_cycle be run in sync ? this would simplify the pattern a lot, and debugging problems and corner cases would be easier. Remember that eventually you have the uwsgi signal framework that does not suffer from races, and allows you to "broadcast" a condition/signal. (and you can run from the master_cycle hook)

dmugtasimov commented 9 years ago

Another option is to use Emperor, but I consider it as a fallback if I do not succeed with my solution. I need a separate thread to renew cache periodically and initiate workers respawn. What I gave here is simplified code to reproduce the problem. In my full-featured solution I synchronize my thread with master_cycle thread with pthread_mutex_lock() to prevent forking while sending singals to workers, but this does not make any difference to reproductions, so I removed this part from snippet.

You suggest to send signals from master_cycle and see if workers stop or continue to hang. It is an interesting idea for localizing the problem. I will try it, thank you.

Do you also suggest to uwsgi.signal() instead of os.kill()? Can I send SIGHUP without prior registering it? If it works out it may still not fit my needs, since I need send SIGHUP to every particular worker, not to make them die all at once, to distribute performance impact in some period of time.

unbit commented 9 years ago

Have you tried spawning the thread in the master (using --shared-import) and check if the problem is still there ? This would help understanding if there is some memory corruption

xrmx commented 8 years ago

1 year without followup, closing this.