Open grobinson-grafana opened 3 months ago
Another option is to add the receiver name to the RouteKey
. This would be the least invasive change and fixes all but one case of non unique GroupKeys where the receiver name, matchers and group by are the same. In this case, the aggregation groups also share nflogs, even if some of the routes have active or mute time intervals.
Here is an example of how this could look for the following configuration file:
receivers:
- name: test1
- name: test2
route:
receiver: test1
routes:
- receiver: test1
matchers:
- foo=bar
continue: true
- receiver: test2
matchers:
- foo=bar
mute_time_intervals:
- name: weekends
continue: true
Without the change:
{}/{foo="bar"}:{}
{}/{foo="bar"}:{}
With the change:
(recv="test1",matchers={})/(recv=test1,matchers={foo="bar"}):{}
(recv="test1",matchers={})/(recv=test2,matchers={foo="bar"}):{}
This could also be shortened to something like:
(test1,{})/(test1,{foo="bar"}):{}
(test1,{})/(test2,{foo="bar"}):{}
~If we choose this option, we would also want to change the nflog interface to remove r *pb.Receiver
as the receiver name is now included in gkey
:~ We still need this as other metadata is required from *pb.Receiver
.
func (l *Log) Log(r *pb.Receiver, gkey string, firingAlerts, resolvedAlerts []uint64, expiry time.Duration) error {
Any chance the example needs to have a "continue" to match the second route?
Yes it does :) I missed that in the example!
Yes it does :) I missed that in the example!
What if we don't attempt to fix this without requiring extra configuration? What if a future version of AM refuses to accept a config where both GroupKey and Receiver are identical for two siblings? What is the users intention in creating such config?
We could offer a user a way to differentiate these two routes (extraGroupKey: newRoute
)
What if we don't attempt to fix this without requiring extra configuration? What if a future version of AM refuses to accept a config where both GroupKey and Receiver are identical for two siblings?
This is something that I've considered too. I like it a lot because it means we don't need to add extra configuration, but I'm also concerned about breaking configurations.
We could offer a user a way to differentiate these two routes
If we choose this option I propose adding a "name" field to the route, similar to what we have for receivers.
Can we add a deduplication key instead? The default Pagerduty receiver has it. Something similar seems easier to create than keeping track of the intricacies of how hashed values from other fields track over time/user changes on configuration.
Can we add a deduplication key instead? The default Pagerduty receiver has it. Something similar seems easier to create than keeping track of the intricacies of how hashed values from other fields track over time/user changes on configuration.
Yeah that's one of the options (named routes).
Background
GroupKey is a string that is derived from the matchers in the route (including any parent routes) and the labels in the
group_by
of the matching route.There are a number of cases where GroupKey is not unique and its possible for two (or more) different groups to have the same GroupKey. The following YAML shows a configuration containing two routes that create groups with the same GroupKey:
The reason a user might have such a configuration is to mute notifications on the weekends, but still send webhooks to an issue tracker.
Problems
This creates a number of problems:
Solutions
I've been thinking on the issue of the GroupKey being non-unique for a little while, and if its possible to make the GroupKey both stable and unique without adding new fields to the configuration file.
What do I mean by stable? I mean that the GroupKey should not depend on the position of the route relative to its siblings. This is the reason why we cannot use the RouteID when calculating the GroupKey, because as soon as the route is moved higher or lower in the configuration file the group's notification log is invalidated causing repeat notifications.
One option we have is to use a non cryptographic hash function (for example fnv) to calculate a hash using extra metadata from the route that is not included in the GroupKey at present. For example:
This would change how GroupKey is calculated from being a top-down path of routes from parent to child (i.e.
route0/route1/route...N/groupLabels
) to a bottom-up Merkle tree where the parents are derived from the hash of their children.This means that when either the receiver, matchers, group by labels or active or mute time intervals are changed in a child route, the notification logs for the parent upwards are invalidated. This might be undesirable. The other issue I see with this solution is observability. Without adequate tools, it will be very hard to see which group is flushed in the logs just from its hash.
So perhaps it's better to just add an (optional) name to routes? If the name is absent, then Alertmanager uses the current mechanism (which can be non unique in some cases). If a unique GroupKey is required, then a name can be set on the route to discriminate between them.