Time window in metric name

cazeaux commented 2 years ago

Hello

Sloth has made the choice to add the time window in the metric name, like slo:sli_error:ratio_rate2h. But when I use sloth to manage a large amount of SLI, with different SLO and time windows, it appears that having the time window complexifies the management of the metrics.

Problem statement

With a large number of SLI/SLO with different time window, we arrive to a situation where some metrices are not applicable to all SLI, and it is impossible to learn from the metrics which can be used or not.

To simplify, consider the case

SLI1 with SLO windows 2h + 4h
SLI2 with SLO windows 4h + 1d

It will generate 3 metrics:

slo:sli_error:ratio_rate2h
slo:sli_error:ratio_rate4h
slo:sli_error:ratio_rate1d

I have no way to know that SLI1 has only 2h and 4h by looking at the metrics only. And I have no way to request label values to know the windows available for a given SLI.

Proposition

I believe that the choice has been done based on prometheus rules best practices, but I think that there is a kind of inconsistence.

According to the practice, a metrice should be named level:metric:operations, where the metric is the name of the metric on which the operation is made. So it makes sense to have the time window in the metric name when it is clearly linked to a given metric, when for another metric we could have other time windows. But for sloth, the metric is always sli_error so we lose part of the practice rationale, as we are mixing a generic part (the metric) and a specific part (the time window).

So I can propose two solutions:

The most simple: simply remove the time window to keep consistent metric names. Because the label sloth_window is already set, we have nothing to add. Furthermore, thanks the the generic name it becomes possible to learn the available windows with label_values(slo:sli_error:ratio_rate{sloth_service="my-service", sloth_slo="my-slo"}, sloth_window) request.

Maybe more complex, but more compliant to the prometheus pracices: we replace sli_error by a name pointing on the SLI itself. Thus by construction the time window in the metric name remain specific to the SLI. As name, we can use the sloth_service + sloth_slo. For example: slo:my_service:my_slo:ratio_rate2h This solution still requires to add a way to discover the available windows.

What is your opinion on that ?

slok commented 2 years ago

Hi @cazeaux

Thanks for the issue.

Indeed, I decided to break the best practices of Prometheus recording rules in favor of UX. One of the main problems that I saw in all Prometheus SLO-related implementations was this, so... I decided to break the rule and have an easier-to-use and discovery.

Why it's inconsistent? well, mainly because at the beginning Sloth only had the 30d window support and I thought that would be enough in favor of simplicity, however, after the huge success of Sloth and the different Sloth user use cases, I decided to add support for more windows. And here we are! so there isn't a good reason for it, just historical reasons (and now kind of tech debt).

So, you are right, we should make something about it, but I'm postponing this because this will break all Sloth installations...

The way I wanted to fix this is exactly the one that you propose first (the most simple), so we are aligned in this sense :smile:

cazeaux commented 2 years ago

Hello

To avoid breaking existing installations, can we add this as an option (like it was done for the optimized rules) ?

tokheim commented 9 months ago

Just want to offer you a partial solution that might help. Metric names are actually treated as labels in prometheus. So you can do an expression like

{__name__=~"slo:sli_error:ratio_rate.*",sloth_service="my-service", sloth_slo="my-slo"}

to pick up all the windows for a given slo.

slok / sloth

Time window in metric name #326

Problem statement

Proposition