Open mmazur opened 2 years ago
Agreed. This seems to be just a different way of averaging ratios, as far as I can tell.
The missing information is the number of requests in each 5m period. Without that, a 5 minute period with 1 error in 10 requests (10% error rate) will be treated equally to a 5 minute period with 1,000 errors in 10,000 requests (also a 10% error rate). But the 1,000 errors should contribute significantly more to the overall 30 day error rate than the 1 error.
The (default) optimized
slo:sli_error:ratio_rate30d
uses an expression ofsum_over_time() / count_over_time()
. This is following 9cd31771 which changed it fromavg_over_time()
.I'm very confused on what the difference is. The definition of an arithmetic average (mean) is
sum() / count()
, so unless there's something unusual in prom's implementation of these functions, I would expect the two expressions to be equivalent.Prom's best practices on recording rules does mention:
But sloth does not preserve either the numerator or denominator, therefore doing that is not possible.