prometheus / alertmanager

Prometheus Alertmanager
https://prometheus.io
Apache License 2.0
6.52k stars 2.13k forks source link

How to fix non unique GroupKeys? #3817

Open grobinson-grafana opened 3 months ago

grobinson-grafana commented 3 months ago

Background

GroupKey is a string that is derived from the matchers in the route (including any parent routes) and the labels in the group_by of the matching route.

There are a number of cases where GroupKey is not unique and its possible for two (or more) different groups to have the same GroupKey. The following YAML shows a configuration containing two routes that create groups with the same GroupKey:

receivers:
  - name: test1
  - name: test2
route:
  receiver: test1
  routes:
    - receiver: test1
      matchers:
        - foo=bar
      continue: true
    - receiver: test2
      matchers:
        - foo=bar
      mute_time_intervals:
        - name: weekends
      continue: true

The reason a user might have such a configuration is to mute notifications on the weekends, but still send webhooks to an issue tracker.

Problems

This creates a number of problems:

  1. In the case the receiver is also the same, then groups from these routes also share notification logs.
  2. If we want to use the GroupKey to mark groups (e.g. mute status) (#3513) then active groups can be incorrectly marked as muted.

Solutions

I've been thinking on the issue of the GroupKey being non-unique for a little while, and if its possible to make the GroupKey both stable and unique without adding new fields to the configuration file.

What do I mean by stable? I mean that the GroupKey should not depend on the position of the route relative to its siblings. This is the reason why we cannot use the RouteID when calculating the GroupKey, because as soon as the route is moved higher or lower in the configuration file the group's notification log is invalidated causing repeat notifications.

One option we have is to use a non cryptographic hash function (for example fnv) to calculate a hash using extra metadata from the route that is not included in the GroupKey at present. For example:

  1. Receiver name
  2. Mute time intervals
  3. Active time intervals
  4. Child routes

This would change how GroupKey is calculated from being a top-down path of routes from parent to child (i.e. route0/route1/route...N/groupLabels) to a bottom-up Merkle tree where the parents are derived from the hash of their children.

This means that when either the receiver, matchers, group by labels or active or mute time intervals are changed in a child route, the notification logs for the parent upwards are invalidated. This might be undesirable. The other issue I see with this solution is observability. Without adequate tools, it will be very hard to see which group is flushed in the logs just from its hash.

So perhaps it's better to just add an (optional) name to routes? If the name is absent, then Alertmanager uses the current mechanism (which can be non unique in some cases). If a unique GroupKey is required, then a name can be set on the route to discriminate between them.

grobinson-grafana commented 3 months ago

Another option is to add the receiver name to the RouteKey. This would be the least invasive change and fixes all but one case of non unique GroupKeys where the receiver name, matchers and group by are the same. In this case, the aggregation groups also share nflogs, even if some of the routes have active or mute time intervals.

Here is an example of how this could look for the following configuration file:

receivers:
  - name: test1
  - name: test2
route:
  receiver: test1
  routes:
    - receiver: test1
      matchers:
        - foo=bar
      continue: true
    - receiver: test2
      matchers:
        - foo=bar
      mute_time_intervals:
        - name: weekends
      continue: true

Without the change:

{}/{foo="bar"}:{}
{}/{foo="bar"}:{}

With the change:

(recv="test1",matchers={})/(recv=test1,matchers={foo="bar"}):{}
(recv="test1",matchers={})/(recv=test2,matchers={foo="bar"}):{}

This could also be shortened to something like:

(test1,{})/(test1,{foo="bar"}):{}
(test1,{})/(test2,{foo="bar"}):{}

~If we choose this option, we would also want to change the nflog interface to remove r *pb.Receiver as the receiver name is now included in gkey:~ We still need this as other metadata is required from *pb.Receiver.

func (l *Log) Log(r *pb.Receiver, gkey string, firingAlerts, resolvedAlerts []uint64, expiry time.Duration) error {
zecke commented 3 months ago

Any chance the example needs to have a "continue" to match the second route?

grobinson-grafana commented 3 months ago

Yes it does :) I missed that in the example!

zecke commented 3 months ago

Yes it does :) I missed that in the example!

What if we don't attempt to fix this without requiring extra configuration? What if a future version of AM refuses to accept a config where both GroupKey and Receiver are identical for two siblings? What is the users intention in creating such config?

We could offer a user a way to differentiate these two routes (extraGroupKey: newRoute)

grobinson-grafana commented 3 months ago

What if we don't attempt to fix this without requiring extra configuration? What if a future version of AM refuses to accept a config where both GroupKey and Receiver are identical for two siblings?

This is something that I've considered too. I like it a lot because it means we don't need to add extra configuration, but I'm also concerned about breaking configurations.

We could offer a user a way to differentiate these two routes

If we choose this option I propose adding a "name" field to the route, similar to what we have for receivers.

KA-ROM commented 1 month ago

Can we add a deduplication key instead? The default Pagerduty receiver has it. Something similar seems easier to create than keeping track of the intricacies of how hashed values from other fields track over time/user changes on configuration.

grobinson-grafana commented 1 month ago

Can we add a deduplication key instead? The default Pagerduty receiver has it. Something similar seems easier to create than keeping track of the intricacies of how hashed values from other fields track over time/user changes on configuration.

Yeah that's one of the options (named routes).