core.overloaded metric is broken, just climbs instead of resetting

nickwilliams-eventbrite commented 5 years ago

If the number of busy workers is greater than (not possible, right?) or equal to (possible) the number of configured workers (the server is maxed out), uWSGI increments the core.overloaded metric:

https://github.com/unbit/uwsgi/blob/3149df02ed443131c54ea6afb29fcbb0ed4d1139/core/master.c#L50-L51

The problem is that this counter is getting incremented every time the server is overloaded. Contrast that to the number of idle and busy workers, which simply get set to the number of busy or idle workers:

https://github.com/unbit/uwsgi/blob/3149df02ed443131c54ea6afb29fcbb0ed4d1139/core/master.c#L63-L64

The result of this, in our graphs, is that the number of busy workers goes up as load increases and goes down as load decreases (correct), but the core.overloaded metric only ever goes up, never down (incorrect). Every time a server is overloaded, the value increases. It never decreases. Even with load discontinued and all workers idle (120 idle workers, 0 busy workers), our "overloaded" counter shows 4,551 (there are only 120 total configured workers across 15 servers, so this should be a maximum of 15). Only by restarting all the uWSGI servers does the "overloaded" counter return to 0.

I believe that the increment should actually be = 1, and that the if block should have an else block that sets the metric = 0, like so:

if (busy_workers >= (uint64_t) uwsgi.numproc) {
    ushared->overloaded = 1;
    ... do other things ...
} else {
    ushared->overloaded = 0;
}

xrmx commented 5 years ago

Agree with your analysis, and i think you can drop the > as busy_workers is calculated looping on uwsgi.numproc which as far i can see is set on startup and never touched. I think the > would (maybe?) make sense only if compared with the number of busy workers + the ":zerg" ones (uwsgi.emperor_broodlord_count) from vassal_sos from https://uwsgi-docs.readthedocs.io/en/latest/Broodlord.html

awelzel commented 5 years ago

@xrmx - in core/metrics.c the overloaded metric is registered as UWSGI_METRIC_COUNTER - I would say incrementing overloaded is the right thing. Busy/idle workers in contrast are registered as UWSGI_METRIC_GAUGE.

Changing the overloaded metric may actually break existing setups - lets not?

@nickwilliams-eventbrite - just apply a derivative to your series - graphite/carbon, influxdb, prometheus ... all will allow you to do that in some way.

nickwilliams-eventbrite commented 5 years ago

@awelzel: Lots of uWSGI's metrics are counters, but overloaded is the only one that increases to infinity. All of uWSGI's other metrics go up and down, up and down. Also, overloaded resets to zero when uWSGI restarts, so even if we wanted it to increase to infinity forever, it couldn't. overloaded is the lone wolf among uWSGI metrics and is clearly broken.

(FWIW: We use Datadog/Grafana, and no combination of functions, etc. has been able to make the graph of overloaded look like it's supposed to look.)

awelzel commented 5 years ago

@nickwilliams-eventbrite

overloaded is the lone wolf among uWSGI metrics and is clearly broken.

I disagree - requests (number of requests processed), failed_requests, respawn_count, routed_signals and even running_time all behave like this.

(I do agree the name overloaded might have been an unfortunate choice.)

When talking about "counter metrics" - they actually never go down (unless it's reset to 0).

Regarding Datadog, you should get what you want by using diff() - not tested though...

diff(<METRIC_NAME>{*})

https://docs.datadoghq.com/graphing/functions/rate/#delta-value

xrmx commented 5 years ago

@awelzel yeah, we cannot break current setups.

Probably some documentation will help though.

BTW The diff() works fine for me :)

nickwilliams-eventbrite commented 5 years ago

diff does not work for me. It results in a wildly fluctuating graph from 250 to -250 even on an idle server, and we only have 16 servers deployed, so this number should never be less than zero or more than 16.

Indeed, I was misreading some of these metrics, which are gauges, not counters. If we're concerned about breaking existing setups and also with being consistent across our metrics, then I propose a new gauge metric, core.is_overloaded or core.overloaded.gauge. It will always be either 0 or 1. Being a gauge, it will be consistent with the core.busy_workers and core.idle_workers metrics, which are also gauges.

xrmx commented 5 years ago

@nickwilliams-eventbrite i guess you have to take into account the negative diff after a restart with something like max(0, diff(overloaded)). It looks correct to be bigger than num servers as the value will be incremented each time the instance is overloaded, not just on the first one. The fact that overload is a counter permits to show not only that your instances are overloaded but also since when they are.

ngaya-ll commented 4 years ago

@nickwilliams-eventbrite Assuming you're using the uwsgi-dogstatsd plugin, since Datadog doesn't have great support for monotonic counters I recommend setting dogstatsd-all-gauges = true so the counter gets reported as a gauge. Then in the Datadog UI you can use the monotonic_diff() function with rollup(max) to graph the incremental counts.

unbit / uwsgi

core.overloaded metric is broken, just climbs instead of resetting #2036