slok / sloth

🦥 Easy and simple Prometheus SLO (service level objectives) generator
https://sloth.dev
Apache License 2.0
2.07k stars 169 forks source link

Expressions should produce continuous data during low or zero traffic #440

Open clux opened 1 year ago

clux commented 1 year ago

The generated SLIs do not currently produce smooth graphs in grafana or prometheus in cases where there's low traffic or missing data, but they could easily do so with a couple of minor additions.

The two cases:

When there are no errors recently the numerator in error_query / total_query will often be absent when users have not initialised their error metrics to zero values. This can be handled by doing a or on() vector(0) in the numerator (or across the whole fraction), however this fix does not work when there is also no traffic.

If there's no traffic, then the denominator in that query is zero, (at least if the metrics are properly initialised). This means we get an absent metric in prometheus (i.e. missing data), and in grafana it's even worse because zero division actually yields something pretty buggy ( https://github.com/grafana/grafana/issues/59349 ). At any rate, restricting the denominator explicitly to non-zero values, lets us default the undefined/missing parts equally and gives us a smooth default in both prometheus and grafana:

(error_query / total_query > 0) or on() vector(0)

I.e. it should be a fairly easy thing to add to sloth. We avoid dividing by zero, and returns an absent metric instead (when the total_query returns zero), thus the fallback kicks in. This catches both the cases where any of the metrics are unitialised, plus when we have zero over zero in the expression.

WDYT? Would you be open to a change like this?

clux commented 1 year ago

Have updated the issue a bit. Tried to clarify that this is not just a grafana display issue (though it is worse in grafana), but about producing a continuous SLI output even though traffic is low/zero.

zhdanovartur commented 1 year ago

Hi, @clux. I had the same question and solved it with raw query. Perhaps this can also help you:

- name: "requests-availability"
  objective: 95
  sli:
    raw:
      errorRatioQuery: |
        (
          (sum(rate(istio_requests_total{reporter="source", destination_service="app", response_code=~"5.."}[{{.window}}])))
          /
          (sum(rate(istio_requests_total{reporter="source", destination_service="app"}[{{.window}}])) > 0)
        ) OR on() vector(0)
  alerting:
    name: high_error_rate
    labels:
      category: "availability"
clux commented 1 year ago

Ah, good to see it is possible. I was hoping that this type of thing could perhaps be defaulted within sloth though, so that not every user would have to discover this.