Open nickwilliams-eventbrite opened 5 years ago
Agree with your analysis, and i think you can drop the >
as busy_workers is calculated looping on uwsgi.numproc which as far i can see is set on startup and never touched. I think the >
would (maybe?) make sense only if compared with the number of busy workers + the ":zerg" ones (uwsgi.emperor_broodlord_count
) from vassal_sos
from https://uwsgi-docs.readthedocs.io/en/latest/Broodlord.html
@xrmx - in core/metrics.c
the overloaded
metric is registered as UWSGI_METRIC_COUNTER
- I would say incrementing overloaded
is the right thing. Busy/idle workers in contrast are registered as UWSGI_METRIC_GAUGE
.
Changing the overloaded
metric may actually break existing setups - lets not?
@nickwilliams-eventbrite - just apply a derivative to your series - graphite/carbon, influxdb, prometheus ... all will allow you to do that in some way.
@awelzel: Lots of uWSGI's metrics are counters, but overloaded
is the only one that increases to infinity. All of uWSGI's other metrics go up and down, up and down. Also, overloaded
resets to zero when uWSGI restarts, so even if we wanted it to increase to infinity forever, it couldn't. overloaded
is the lone wolf among uWSGI metrics and is clearly broken.
(FWIW: We use Datadog/Grafana, and no combination of functions, etc. has been able to make the graph of overloaded
look like it's supposed to look.)
@nickwilliams-eventbrite
overloaded is the lone wolf among uWSGI metrics and is clearly broken.
I disagree - requests
(number of requests processed), failed_requests
, respawn_count
, routed_signals
and even running_time
all behave like this.
(I do agree the name overloaded
might have been an unfortunate choice.)
When talking about "counter metrics" - they actually never go down (unless it's reset to 0).
Regarding Datadog, you should get what you want by using diff()
- not tested though...
diff(<METRIC_NAME>{*})
https://docs.datadoghq.com/graphing/functions/rate/#delta-value
@awelzel yeah, we cannot break current setups.
Probably some documentation will help though.
BTW The diff()
works fine for me :)
diff
does not work for me. It results in a wildly fluctuating graph from 250 to -250 even on an idle server, and we only have 16 servers deployed, so this number should never be less than zero or more than 16.
Indeed, I was misreading some of these metrics, which are gauges, not counters. If we're concerned about breaking existing setups and also with being consistent across our metrics, then I propose a new gauge metric, core.is_overloaded
or core.overloaded.gauge
. It will always be either 0 or 1. Being a gauge, it will be consistent with the core.busy_workers
and core.idle_workers
metrics, which are also gauges.
@nickwilliams-eventbrite i guess you have to take into account the negative diff after a restart with something like max(0, diff(overloaded))
. It looks correct to be bigger than num servers as the value will be incremented each time the instance is overloaded, not just on the first one. The fact that overload is a counter permits to show not only that your instances are overloaded but also since when they are.
@nickwilliams-eventbrite Assuming you're using the uwsgi-dogstatsd plugin, since Datadog doesn't have great support for monotonic counters I recommend setting dogstatsd-all-gauges = true
so the counter gets reported as a gauge. Then in the Datadog UI you can use the monotonic_diff()
function with rollup(max)
to graph the incremental counts.
If the number of busy workers is greater than (not possible, right?) or equal to (possible) the number of configured workers (the server is maxed out), uWSGI increments the
core.overloaded
metric:https://github.com/unbit/uwsgi/blob/3149df02ed443131c54ea6afb29fcbb0ed4d1139/core/master.c#L50-L51
The problem is that this counter is getting incremented every time the server is overloaded. Contrast that to the number of idle and busy workers, which simply get set to the number of busy or idle workers:
https://github.com/unbit/uwsgi/blob/3149df02ed443131c54ea6afb29fcbb0ed4d1139/core/master.c#L63-L64
The result of this, in our graphs, is that the number of busy workers goes up as load increases and goes down as load decreases (correct), but the
core.overloaded
metric only ever goes up, never down (incorrect). Every time a server is overloaded, the value increases. It never decreases. Even with load discontinued and all workers idle (120 idle workers, 0 busy workers), our "overloaded" counter shows 4,551 (there are only 120 total configured workers across 15 servers, so this should be a maximum of 15). Only by restarting all the uWSGI servers does the "overloaded" counter return to 0.I believe that the increment should actually be
= 1
, and that theif
block should have anelse
block that sets the metric= 0
, like so: