Open debugloop opened 4 years ago
If you wish to do something intricate like this, why not adjust your labels using alert_relabel_configs?
Yes, we thought about something along these lines:
- source_labels: [hostname]
target_label: inhibit_marker
- source_labels: [remote]
target_label: inhibit_marker
However, this leaves people wondering what the meaning of this label on the alert is. I can not map one label to the other, as both labels are needed, thus the need for a helper label such as inhibit_marker
(or something, suggestions welcome). Maybe I am overlooking a possibility of relabeling tho.
It is also somewhat error prone, as I don't think I can limit the action to a specific type of alert (maybe by using a chain of labelmap rules and __
prefixed labels? I haven't checked yet).
I feel checking equality of two different labels is a very natural feature for inhibit rules, independent of the intricacy of my example. At least, the alternative using relabeling should be noted in the docs beside the equal
statement.
Label names are meant to have one specific meaning, if you find yourself trying to match label name A with label name B via any means - be that PromQL or inhibition - that implies that something may not be quite right with your label taxonomy. Inhibition is also meant more for very blunt prevention of alerts such as when an entire datacenter network has gone kaput. Trying to use it for something this fine grained would hint to me that there's an attempt at cause based rather than symptom based alerting.
Both labels have one specific meaning. On a RouterDown alert, the hostname
identifies the router (as instance labels are used for IPv4 and IPv6). On a InterfaceDown alert, the hostname
does the same, but additionally the remote
serves as a marker of dependency. While it is true that having the option might invite cause based alerts, my example is strictly symptom based I think.
It really is not so intricate either, it is basically the same as the very blunt prevention you've described. If an entire datacenter network goes kaputt, the links leading there will be going down and will thus trigger alerts I intend to manage using the alertmanager.
In my opinion, having Prometheus sort this out using the relabel_configs described above just creates an additional place of configuration for alert management, which is not within the alertmanager. In addition, it seems somewhat hacky to me to create labels to me matching on, when there could be a mechanism that easily achieves the same result without crutches while being succinct, to the point, and in the software you'd expect it in.
A router or interface going down is a cause, not a symptom. A symptom would be users no longer being able to get the the website behind the router.
We are a service provider, keeping customer interfaces online is the only symptom.
The cause could be:
As explained above what you want is already possible with existing features, if you try to make cause-based alerting work you have to expect it to take extra work.
I understand your reasoning in not wanting to add unnecessary features and can accept you saying so, although I find these existing solutions cumbersome for the reasons outlined above.
At the same time, I'm not sure you did consider the actual points I was making because you keep trying to erode the validity of my use case instead of adressing said points. Could you explain to me what you think a symptom is in a service provider setting, if not the availability of a customer interface? Why do you think suppressing potentially tens to hundreds of InterfaceDown (and thus CustomerDown) alerts when their upstream router is down is to intricate an inhibition rule?
i would like this too. heres my use case: i have a global full mesh. i have custom metrics of last keepalive across the mesh. instance is the node reporting the metric and peer is the node on the other side of the connection. I have an alert for when same peer has >5 instances alerting on it. When that fires i want to inhibit the individual alerts for instance when it matches the inhibited peer. Now I can inhibit the single alerts on the peer, but I would also like to inhibit any alerts where instance of the alert is already alerting for the peer >5 alert. instance label would equal peer for this inhibit
No solution for this problem? ;(
I have alerts for and would like to see only r1:
probe_success{instance="10.0.0.1", hostname="r1"}
probe_success{instance="10.0.1.1", hostname="r2", remote="r1"}
probe_success{instance="10.0.2.1", hostname="r3", remote="r2"}
:+1:
we have the same kind of use case, it would be far way easier than the workaround we have to use.
I'd like to be able to compare different label but also part of a label using regex.
something like (idea of syntax as-is):
- source_matchers:
- alertname="ServerDown"
- host=~"(.+)"
target_matchers:
- alertname="kafka_stream_not_consuming"
- kafka_stream=~"my_stream_with_a_name_including_(.+)_a_host_identifier"
equality:
- source: host
target: kafka_stream_$1
While the title basically says it all, I will try to back this up using a concrete example. Imagine a setup where in addition to all routers being a target for some type of
blackbox_exporter
-style metrics an additional source of data is used to generate JSON target lists forfile_sd
.In my example, this additional source could be a routers configuration backup, which gives us a definitive list of link addresses configured on any interface of any router, as well as any configured meta data for each interface. It is trivial to build a JSON file containing the targets (all link addresses configured locally) with the relevant labels:
These metrics could end up looking like this (
10.0.0.0/8
are link adresses,192.168.0.0/16
are loopbacks):Alerting in the most obvious way would create alerts similar to the job names, for instance a
RouterDown
alert with the expressionprobe_success{job="ping-router-loopback"} == 0
. I would obviously want the following inhibition rule:This would inhibit the alert informing me that an interface is down on a router which is already being alerted as Down itself. I would however like to go one step further using a inhibit rule such as the following one, as it does not come as a surprise that any interface adjacent to the downed router will go down in turn, even though it is on another router/in another region/whatever.
While the proposed syntax variants are just a general idea, I feel that this should be possible in some way which does not involve hacking around with the underlying alerts expressions.