Open cazeaux opened 2 years ago
Hi @cazeaux
Thanks for the issue.
Indeed, I decided to break the best practices of Prometheus recording rules in favor of UX. One of the main problems that I saw in all Prometheus SLO-related implementations was this, so... I decided to break the rule and have an easier-to-use and discovery.
Why it's inconsistent? well, mainly because at the beginning Sloth only had the 30d window support and I thought that would be enough in favor of simplicity, however, after the huge success of Sloth and the different Sloth user use cases, I decided to add support for more windows. And here we are! so there isn't a good reason for it, just historical reasons (and now kind of tech debt).
So, you are right, we should make something about it, but I'm postponing this because this will break all Sloth installations...
The way I wanted to fix this is exactly the one that you propose first (the most simple), so we are aligned in this sense :smile:
Hello
To avoid breaking existing installations, can we add this as an option (like it was done for the optimized rules) ?
Just want to offer you a partial solution that might help. Metric names are actually treated as labels in prometheus. So you can do an expression like
{__name__=~"slo:sli_error:ratio_rate.*",sloth_service="my-service", sloth_slo="my-slo"}
to pick up all the windows for a given slo.
Hello
Sloth has made the choice to add the time window in the metric name, like
slo:sli_error:ratio_rate2h
. But when I use sloth to manage a large amount of SLI, with different SLO and time windows, it appears that having the time window complexifies the management of the metrics.Problem statement
With a large number of SLI/SLO with different time window, we arrive to a situation where some metrices are not applicable to all SLI, and it is impossible to learn from the metrics which can be used or not.
To simplify, consider the case
It will generate 3 metrics:
I have no way to know that SLI1 has only 2h and 4h by looking at the metrics only. And I have no way to request label values to know the windows available for a given SLI.
Proposition
I believe that the choice has been done based on prometheus rules best practices, but I think that there is a kind of inconsistence.
According to the practice, a metrice should be named
level:metric:operations
, where themetric
is the name of the metric on which the operation is made. So it makes sense to have the time window in the metric name when it is clearly linked to a given metric, when for another metric we could have other time windows. But for sloth, themetric
is alwayssli_error
so we lose part of the practice rationale, as we are mixing a generic part (the metric) and a specific part (the time window).So I can propose two solutions:
The most simple: simply remove the time window to keep consistent metric names. Because the label
sloth_window
is already set, we have nothing to add. Furthermore, thanks the the generic name it becomes possible to learn the available windows withlabel_values(slo:sli_error:ratio_rate{sloth_service="my-service", sloth_slo="my-slo"}, sloth_window)
request.Maybe more complex, but more compliant to the prometheus pracices: we replace
sli_error
by a name pointing on the SLI itself. Thus by construction the time window in the metric name remain specific to the SLI. As name, we can use the sloth_service + sloth_slo. For example:slo:my_service:my_slo:ratio_rate2h
This solution still requires to add a way to discover the available windows.What is your opinion on that ?