prometheus / statsd_exporter

StatsD to Prometheus metrics exporter
Apache License 2.0
921 stars 231 forks source link

Statsd-exporter counter metric type not reset #562

Closed popescuag closed 2 months ago

popescuag commented 4 months ago

I'm using statsd-exporter as a tool to translate the statsd metrics already instrumented in a legacy system to Prometheus. One of the main use cases for counter metric types is to determine the number of concurrent requests for a given API method. The problem is that the counter metric I've set is never reset in statsd-exporter. If I use a generic statsd daemon I can see the values of the counter increasing and decreasing as the simulated requests come and go. Eventually, when all requests are finished, the counter is set back to 0. If I use statsd-exporter in the same test the counter keeps increasing and it is never reset (see attached screenshot) Additionally, there are no quantile p90, p99 statistics collected for the counter metric. Those are collected for time based metrics from what I've seen. Can someone please tell me if this can be solved by some statsd-exporter setting I've missed or if this is indeed how the service is expected to behave?

Screenshot 2024-06-17 at 10 11 00
matthiasr commented 2 months ago

This is deliberate – it is part of the conversion from the statsd/graphite metrics model to the Prometheus metrics model.

Graphite works with fixed time windows; it stores that "in the time window from 13:42:10 to 13:42:20, there were 12 events". This means that in the query language, you're already working with event rates and it has functions to deal with the conversion to "per second" rates. OpenTelemetry calls this "delta temporality".

On the contrary, Prometheus works on a cumulative counter model. It stores that "in the time leading up to 13:42:10, there were 6420521 events; in the time leading up to 13:42:10 there were 6420533 events; …". OpenTelemetry calls this "cumulative temporality". In PromQL, you use the rate (or increase) functions to turn this back into a per-second (or per-time-window) rate. In your case, rate(hb_tspent_count[5m]) is what becomes zero when events stop happening.

Note that in neither case what you get is the number of concurrent requests – you get the number of requests in a time window, but you don't know how many of these were actually concurrent. In the example you give, the average time spent appears to be 0.00018 in whatever units you're reporting in (should be seconds, but that seems implausibly low). In PromQL, you can calculate the actual average concurrency with rate(hb_tspent_sum[5m]) assuming you are reporting in seconds (think of this as "time spent responding to requests per second of wall-clock time", or an application of Little's Law.