slok / sloth

🦥 Easy and simple Prometheus SLO (service level objectives) generator
https://sloth.dev
Apache License 2.0
2.08k stars 171 forks source link

How is `sum_over_time() / count_over_time()` different than `avg_over_time()`? #354

Open mmazur opened 2 years ago

mmazur commented 2 years ago

The (default) optimized slo:sli_error:ratio_rate30d uses an expression of sum_over_time() / count_over_time(). This is following 9cd31771 which changed it from avg_over_time().

I'm very confused on what the difference is. The definition of an arithmetic average (mean) is sum() / count(), so unless there's something unusual in prom's implementation of these functions, I would expect the two expressions to be equivalent.

Prom's best practices on recording rules does mention:

When aggregating up ratios, aggregate up the numerator and denominator separately and then divide. Do not take the average of a ratio or average of an average as that is not statistically valid.

But sloth does not preserve either the numerator or denominator, therefore doing that is not possible.

ThomWright commented 8 months ago

Agreed. This seems to be just a different way of averaging ratios, as far as I can tell.

The missing information is the number of requests in each 5m period. Without that, a 5 minute period with 1 error in 10 requests (10% error rate) will be treated equally to a 5 minute period with 1,000 errors in 10,000 requests (also a 10% error rate). But the 1,000 errors should contribute significantly more to the overall 30 day error rate than the 1 error.