unbit / uwsgi

uWSGI application server container
http://projects.unbit.it/uwsgi
Other
3.46k stars 691 forks source link

cheaper_busyness conflicts with gevent? #880

Open dmk23 opened 9 years ago

dmk23 commented 9 years ago

cheaper_busyness configured with gevent, starting with 4 workers

a few request groups over a few hours (each "request group" hitting several web resources under uWSGI at a time) - about half of each spawning new workers:

[busyness] 30s average busyness is at 62%, will spawn 1 new worker(s)
spawned uWSGI worker 9 (pid: 6690, cores: 100)
python tracebacker for worker 9 available on /opt/tc-live/logs/tracebacks/traffic.sock9

Now, the disturbing thing is that despite hours of complete absence of requests the workers never get collected. I suspect busyness gets confused by async activity in Gevent, mistaking Gevent's work of managing greenlets for actual request activity. I should mention the total system's CPU meter shows near-idleness of uWSGI processes.

Here is my relevant config -

http-socket = 0.0.0.0:$(PORT)
stats = 0.0.0.0:$(STATUS_PORT)
listen = 1000

master = True
vacuum = True
die-on-term = True
no-orphans = True
single-interpreter = True
strict = True

processes = 20
gevent = 100
enable-threads = True
max-fd = 100000
thunder-lock = serialize accept() usage

cheaper-algo = busyness
cheaper = 4
cheaper-initial = 4
cheaper-step = 1
cheaper-overload = 30
cheaper-busyness-max = 55

http-timeout = 30
buffer-size = 8192

enable-metrics = True

P.S. The issue might be related to requests that spawn extra threads/greenlets to complete their work. I noticed that if such requests are not involved busyness does not even try to spawn new workers. Perhaps it cannot properly account for user-spawned greenlets?

I should note that all our greenlets are expected to complete, though I am not sure of the right way way to validate this 100% in python runtime... Would tracebacker be suitable for that?

P.P.S. When new workers are spawned by busyness the clients waiting on them (via Apache mod_proxy) sometimes get their requests lost:

[Wed Apr 08 13:48:14 2015] [error] [client xx.xx.xx.xx] (70007)The timeout specified has expired: proxy: error reading status line from remote server ZZZ
[Wed Apr 08 13:48:14 2015] [error] [client xx.xx.xx.xx] proxy: Error reading from remote server returned by /myuri
unbit commented 9 years ago

Looking at the busyness code i think it does not take in account workers's async cores but a global value for each process. I am not even sure the plugin has been ever tested in non-process modes.

Are you sure simpler cheaper algos are not enough ?

dmk23 commented 9 years ago

We picked busyness because it seemed to provide the most sensible and flexible controls for managing resources. Now that I had chance to observe its behavior in the async environment, I think it is worth pointing out several issues / open questions:

  1. The problem might not be limited just to busyness algo. Any other cheaper algo has to make some determination of when the load is high enough to spawn or stop new workers.
  2. One could argue cheaper system is not really needed for good async setups. However when too many async tasks are concurrently running per process, they might not always yield gracefully enough and thus having extra processes would let OS allocate CPU time more fairly
  3. Seems like async environment might warrant some special cheaper config settings (for busyness and for other algos). Maybe something that would map the # of active async threads OR # of threads by CPU usage or something else to the "standard" process utilization metric. I am not familiar enough with internals to suggest the most feasible way to implement this...
  4. Loss of connections while spawning new processes still seems like a separate and unpleasant problem. I am not sure what else I can do to debug/document it, other than reporting what I see in logs. Could this have something to do with how the listen queue is affected by cheaper?
  5. I am still wondering what might be the right solution before/without any of these possible fixes being implemented. Seems to me backlog method might be least-affected by async challenges, but I am concerned it could result in too many async tasks clogging each process - since in async environment they might get accepted from the listen queue way too readily
  6. Whether or not these notes would result in any new features / settings, there ought to be some docs on warnings and best practices - probably both in cheaper and async/gevent module sections

Any further thoughts appreciated...