prometheus / alertmanager

Prometheus Alertmanager
https://prometheus.io
Apache License 2.0
6.66k stars 2.16k forks source link

Inhibit Rules should be able to consider different labels in their equal statement #2254

Open debugloop opened 4 years ago

debugloop commented 4 years ago

While the title basically says it all, I will try to back this up using a concrete example. Imagine a setup where in addition to all routers being a target for some type of blackbox_exporter-style metrics an additional source of data is used to generate JSON target lists for file_sd.

In my example, this additional source could be a routers configuration backup, which gives us a definitive list of link addresses configured on any interface of any router, as well as any configured meta data for each interface. It is trivial to build a JSON file containing the targets (all link addresses configured locally) with the relevant labels:

These metrics could end up looking like this (10.0.0.0/8 are link adresses, 192.168.0.0/16 are loopbacks):

# job: ping-router-loopback
probe_success{instance="192.168.0.1", hostname="r1"} 0
probe_success{instance="192.168.0.2", hostname="r2"} 1

# job: ping-router-interface
probe_success{instance="10.0.0.1", hostname="r2", interface="Te0/7/0/12", remote="r1"} 0
probe_success{instance="10.0.0.2", hostname="r1", interface="Te0/2/0/1", remote="r2"} 0

Alerting in the most obvious way would create alerts similar to the job names, for instance a RouterDown alert with the expression probe_success{job="ping-router-loopback"} == 0. I would obviously want the following inhibition rule:

- source_match:
    alertname: RouterDown
  target_match:
    alertname: InterfaceDown
  equal:
  - hostname

This would inhibit the alert informing me that an interface is down on a router which is already being alerted as Down itself. I would however like to go one step further using a inhibit rule such as the following one, as it does not come as a surprise that any interface adjacent to the downed router will go down in turn, even though it is on another router/in another region/whatever.

# option A
- source_match:
    alertname: RouterDown
  target_match:
    alertname: InterfaceDown
  equal:
  - source_label: hostname
    target_label: remote

# option B, which would make the original `equal` kind of unnecessary
# by using `hostname: $hostname` for instance
- source_match:
    alertname: RouterDown
  target_match:
    alertname: InterfaceDown
    remote: "$hostname"

While the proposed syntax variants are just a general idea, I feel that this should be possible in some way which does not involve hacking around with the underlying alerts expressions.

brian-brazil commented 4 years ago

If you wish to do something intricate like this, why not adjust your labels using alert_relabel_configs?

debugloop commented 4 years ago

Yes, we thought about something along these lines:

- source_labels: [hostname]
  target_label: inhibit_marker
- source_labels: [remote]
  target_label: inhibit_marker

However, this leaves people wondering what the meaning of this label on the alert is. I can not map one label to the other, as both labels are needed, thus the need for a helper label such as inhibit_marker (or something, suggestions welcome). Maybe I am overlooking a possibility of relabeling tho.

It is also somewhat error prone, as I don't think I can limit the action to a specific type of alert (maybe by using a chain of labelmap rules and __ prefixed labels? I haven't checked yet).

I feel checking equality of two different labels is a very natural feature for inhibit rules, independent of the intricacy of my example. At least, the alternative using relabeling should be noted in the docs beside the equal statement.

brian-brazil commented 4 years ago

Label names are meant to have one specific meaning, if you find yourself trying to match label name A with label name B via any means - be that PromQL or inhibition - that implies that something may not be quite right with your label taxonomy. Inhibition is also meant more for very blunt prevention of alerts such as when an entire datacenter network has gone kaput. Trying to use it for something this fine grained would hint to me that there's an attempt at cause based rather than symptom based alerting.

debugloop commented 4 years ago

Both labels have one specific meaning. On a RouterDown alert, the hostname identifies the router (as instance labels are used for IPv4 and IPv6). On a InterfaceDown alert, the hostname does the same, but additionally the remote serves as a marker of dependency. While it is true that having the option might invite cause based alerts, my example is strictly symptom based I think.

It really is not so intricate either, it is basically the same as the very blunt prevention you've described. If an entire datacenter network goes kaputt, the links leading there will be going down and will thus trigger alerts I intend to manage using the alertmanager.

In my opinion, having Prometheus sort this out using the relabel_configs described above just creates an additional place of configuration for alert management, which is not within the alertmanager. In addition, it seems somewhat hacky to me to create labels to me matching on, when there could be a mechanism that easily achieves the same result without crutches while being succinct, to the point, and in the software you'd expect it in.

brian-brazil commented 4 years ago

A router or interface going down is a cause, not a symptom. A symptom would be users no longer being able to get the the website behind the router.

debugloop commented 4 years ago

We are a service provider, keeping customer interfaces online is the only symptom.

The cause could be:

brian-brazil commented 4 years ago

As explained above what you want is already possible with existing features, if you try to make cause-based alerting work you have to expect it to take extra work.

debugloop commented 4 years ago

I understand your reasoning in not wanting to add unnecessary features and can accept you saying so, although I find these existing solutions cumbersome for the reasons outlined above.

At the same time, I'm not sure you did consider the actual points I was making because you keep trying to erode the validity of my use case instead of adressing said points. Could you explain to me what you think a symptom is in a service provider setting, if not the availability of a customer interface? Why do you think suppressing potentially tens to hundreds of InterfaceDown (and thus CustomerDown) alerts when their upstream router is down is to intricate an inhibition rule?

krzee commented 2 years ago

i would like this too. heres my use case: i have a global full mesh. i have custom metrics of last keepalive across the mesh. instance is the node reporting the metric and peer is the node on the other side of the connection. I have an alert for when same peer has >5 instances alerting on it. When that fires i want to inhibit the individual alerts for instance when it matches the inhibited peer. Now I can inhibit the single alerts on the peer, but I would also like to inhibit any alerts where instance of the alert is already alerting for the peer >5 alert. instance label would equal peer for this inhibit

genofire commented 1 year ago

No solution for this problem? ;(

I have alerts for and would like to see only r1:

probe_success{instance="10.0.0.1", hostname="r1"}
probe_success{instance="10.0.1.1", hostname="r2", remote="r1"}
probe_success{instance="10.0.2.1", hostname="r3", remote="r2"}
fatpat commented 4 months ago

:+1:

we have the same kind of use case, it would be far way easier than the workaround we have to use.

I'd like to be able to compare different label but also part of a label using regex.

something like (idea of syntax as-is):

   - source_matchers:
       - alertname="ServerDown"
       - host=~"(.+)"
     target_matchers:
       - alertname="kafka_stream_not_consuming"
       - kafka_stream=~"my_stream_with_a_name_including_(.+)_a_host_identifier"
     equality:
       - source: host
         target: kafka_stream_$1