prometheus / docs

Prometheus documentation: content and static site generator
https://prometheus.io
Apache License 2.0
665 stars 1.02k forks source link

Needs example of cluster wide inhibition #1353

Open jpds opened 5 years ago

jpds commented 5 years ago

The documentation here states that inhibition can be used to suppress the same alert coming from an entire cluster:

However, none of the examples I can find online show how this can be done easily in either of these places:

beorn7 commented 5 years ago

Yes, that's a nice one.

The inhibition rule is actually quite easy:

inhibit_rules:
- source_match:
    alertname: 'ClusterIsDown'
  equal: ['cluster']

It works just fine with our self-inhibition prevention. However, it contradicts the recommendation given in https://prometheus.io/docs/alerting/configuration/#inhibit_rule : “However, we recommend to choose target and source matchers in a way that alerts never match both sides.” Is there a better way to write the inhibition rule? @stuartnelson3 @brian-brazil

It's kind of weird that the first example use case we list can only be solved by not following a recommendation given later.

brian-brazil commented 5 years ago

I'd usually have some severity label on both sides, as often you'd want a class of alerts to do this rather than one particular alertname.

jpds commented 5 years ago

@beorn7 Thanks for the snippet!

In our particular use case, we have "clusters" of devices (with multiple exporters running) in the field connected to the Internet by various [sometimes unreliable] means and I don't think severity labels fit with what we're trying to prevent.

For example, when a 4G connection/router fails for a group of devices in a particular area, we do not want to be flooded with notifications for all our devices/alerts save one "something fell off the Internet" notification.

Would a good example for the documentation be combining blackbox_exporter's ICMP probe_success == 0 metric for the alert with an inhibition rule?

jpds commented 5 years ago

I have a feeling that I should label my devices as router/end-device and set an inhibitor on if the router is down, don't alert on the devices...