metalmatze / slo-libsonnet

Generate Prometheus alerting & recording rules and Grafana dashboards for your SLOs.
https://promtools.dev
Apache License 2.0
121 stars 19 forks source link

Wrong alert rules defined for latency #43

Open wei-lee opened 3 years ago

wei-lee commented 3 years ago

At the moment, the generated alert rules for latency is something like this:

latencytarget:http_request_duration_seconds:rate1h{job="prometheus",latency="0.10000000000000001"} > (14.4*1.000000)

The corresponding recording rule for this alert is something like this:

1 - (
        sum(rate(http_request_duration_seconds_bucket{job="prometheus",le="0.10000000000000001",code!~"5.."}[1h]))
        /
        sum(rate(http_request_duration_seconds_count{job="prometheus"}[1h]))
      )

which means the value of this recording rule will never be bigger than 1. That means the alert will never be fired.

If my understanding is correct, we should either multiply the recording rule by 100, or change 1 to 0.01 in the alerts.

brancz commented 3 years ago

I think you might have misconfigured the latencyBudget, as the value in the alerting rule is already templated.

rporres commented 3 years ago

Then maybe the problem is in the code generating the rules in https://promtools.dev/alerts/latency, as it is what @wei-lee used to generate the alerts. @metalmatze should know better 😄

metalmatze commented 3 years ago

Yes, that's probably a problem with promtools.dev itself. If you want to take a look to find the problem here: https://github.com/metalmatze/promtools.dev/blob/master/main.go Otherwise I can see to fix it with another update I have already been working on anyway :)

rporres commented 3 years ago

The problem looks indeed related with promtools.dev and the latencyBudget, as pointed out by @brancz . There's a fix proposal in https://github.com/metalmatze/promtools.dev/pull/13

rporres commented 3 years ago

This can be safely closed.

tbuchier commented 3 years ago

Issue is still present on https://promtools.dev/alerts/latency