Open ShreyanshMehta opened 1 year ago
It looks like this is the same issue as #3407
I would say it's closer to #2492 and #3037, as my issue you reference is specifically about delivering incomplete notifications. That said, yeah, fundamentally all of these are due to the Alertmanager data model not playing well with config changes
Yeah. I agree with @artli.
@RichardWarburton I can work on this issue. Now, I am wondering what could be the right solution to this issue. I've been thinking about one of these solutions:
/-/reload
endpoint is called, Alertmanager should send notifications based on the old configurations for all existing alerts in its cache and empty its cache. We should not get notifications for alerts that we have previously received.
Would it be possible to share debug logs for this as I was not able to reproduce it in Alertmanager. I have the following route with group_wait, group_interval and repeat_interval:
route:
receiver: email
group_wait: 15s
group_interval: 1m
repeat_interval: 5m
curl -H "Content-Type: application/json" http://127.0.0.1:9093/api/v2/alerts -d '[{"labels":{"foo":"bar"}}]'
ts=2023-07-05T14:29:47.287Z caller=cluster.go:700 level=info component=cluster msg="gossip settled; proceeding" elapsed=10.002189959s
ts=2023-07-05T14:29:59.496Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=[3fff2c2][active]
ts=2023-07-05T14:30:14.497Z caller=dispatch.go:515 level=debug component=dispatcher aggrGroup={}:{} msg=flushing alerts=[[3fff2c2][active]]
ts=2023-07-05T14:30:14.645Z caller=notify.go:752 level=debug component=dispatcher receiver=email integration=email[0] msg="Notify success" attempts=1
curl -H "Content-Type: application/json" -XPOST http://127.0.0.1:9093/-/reload
The notification is not resent because repeat_interval (5m) has not elapsed:
ts=2023-07-05T14:30:27.988Z caller=coordinator.go:113 level=info component=configuration msg="Loading configuration file" file=config.yml
ts=2023-07-05T14:30:27.988Z caller=coordinator.go:126 level=info component=configuration msg="Completed loading of configuration file" file=config.yml
ts=2023-07-05T14:30:27.990Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=[3fff2c2][active]
ts=2023-07-05T14:30:27.990Z caller=dispatch.go:515 level=debug component=dispatcher aggrGroup={}:{} msg=flushing alerts=[[3fff2c2][active]]
ts=2023-07-05T14:31:27.989Z caller=dispatch.go:515 level=debug component=dispatcher aggrGroup={}:{} msg=flushing alerts=[[3fff2c2][active]]
Just following up to the previous message I think I understand the issue.
The aggregation group needs to change (i.e. a new alert is added to an existing group) between the last notification and the configuration being reloaded. When the group is flushed a new notification will be sent because https://github.com/prometheus/alertmanager/blob/main/notify/notify.go#L574.
ts=2023-07-05T14:37:58.337Z caller=cluster.go:700 level=info component=cluster msg="gossip settled; proceeding" elapsed=10.002223083s
ts=2023-07-05T14:37:58.628Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=[3fff2c2][active]
ts=2023-07-05T14:38:13.628Z caller=dispatch.go:515 level=debug component=dispatcher aggrGroup={}:{} msg=flushing alerts=[[3fff2c2][active]]
ts=2023-07-05T14:38:13.766Z caller=notify.go:752 level=debug component=dispatcher receiver=email integration=email[0] msg="Notify success" attempts=1
ts=2023-07-05T14:38:15.402Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=[6ad8c19][active]
ts=2023-07-05T14:38:20.339Z caller=coordinator.go:113 level=info component=configuration msg="Loading configuration file" file=config.yml
ts=2023-07-05T14:38:20.339Z caller=coordinator.go:126 level=info component=configuration msg="Completed loading of configuration file" file=config.yml
ts=2023-07-05T14:38:20.341Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=[3fff2c2][active]
ts=2023-07-05T14:38:20.341Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=[6ad8c19][active]
ts=2023-07-05T14:38:20.341Z caller=dispatch.go:515 level=debug component=dispatcher aggrGroup={}:{} msg=flushing alerts="[[3fff2c2][active] [6ad8c19][active]]"
ts=2023-07-05T14:38:20.472Z caller=notify.go:752 level=debug component=dispatcher receiver=email integration=email[0] msg="Notify success" attempts=1
@grobinson-grafana To reproduce the issue, you should use the group_by
field in your route and send some alerts. Then immediately, make changes in that group_by
condition (add or remove the labels) and call the /-/reload
endpoint. You will receive same alerts at your receiver again.
Thanks, I was also able to reproduce the issue here!
Yeah, this one's probably hard to fix because after you change the core config of your group, Alertmanager pretty much starts treating it as a completely different group. This makes some sense because if you change group_by
or matchers
in some of your groups, the way your existing alerts are grouped can change drastically, so it's very hard to think of a general way to correlate the groups before the change with the groups after the change.
As an illustration, you can see that the notification log (which is used for deduplicating notifications) keys notifications by groupKey
+ receiverKey
: https://github.com/prometheus/alertmanager/blob/487db1383b8cc5c2867c77f110431605bb8ce247/nflog/nflog.go#L442C3-L442C3. And groupKey
depends on the labels of this particular instance of the group, which obviously depend on what you've configured it to group by, (https://github.com/prometheus/alertmanager/blob/487db1383b8cc5c2867c77f110431605bb8ce247/dispatch/dispatch.go#L358) and also on routeKey
, which depends on the matchers specified in the config (https://github.com/prometheus/alertmanager/blob/487db1383b8cc5c2867c77f110431605bb8ce247/dispatch/route.go#L171C6-L171C6). I don't think there's an easy way around this except accepting that a change in group configuration essentially creates a new group.
When the /-/reload endpoint is called, Alertmanager should send notifications based on the old configurations for all existing alerts in its cache and empty its cache
IIUC, this would mean that after the config is reloaded, Alertmanager will essentially forget about any alerts it has seen before, which sounds pretty undesirable. For example, if alerts are generally being pushed into Alertmanager every minute with GroupWait=30s, if a config reload makes Alertmanager forget all alerts it knows about, after 30s it might start sending incorrect notifications based on an incomplete understanding of the world.
Agreed @artli.
Alertmanager should additionally maintain track of the alerts that have previously been notified and deliver notifications in response to reloads.
I feel this could be a ideal fix for this issue. We should maintain a cache of type map[string]struct{}
where the key will be the fingerprint of the alerts. Just before delivering any alert, we can do a lookup in the cache and send notification accordingly.
Unfortunately, I don't think that would work (or I don't understand the suggestion fully) because sometimes the routing tree is composed in such a way that a single alert needs to be delivered to multiple receivers through multiple aggregation groups. This can be the case if you use "continue": true
in your routing tree. Moreover, when the config is reloaded, it's totally possible that the new config routes an existing alert to a completely new aggregation group with different receivers and a different grouping, so we do actually need to send a new notification for it even if we've already sent one before.
I installed Prometheus with default email config credentials and then i tried to change the credentials and reloaded the alertmanager. My emails are sending in both older and new credentials. My router intervals are: route: group_interval: 1m group_wait: 10s receiver: email-receiver repeat_interval: 5m
I need to send email only on updated email id's. Can you please check and let us know why
How to recreate the scenario:
group_by
ormatchers
) and call the/-/reload
endpoint. Now, you will receive the notifications for the same alerts again.NOTE: These behaviours are only observed when the same alerts exist in alertmanager's cache and the configuration is reloaded.
Expected Behavior:
We should not get notifications for alerts that we have previously received.
Solutions:
/-/reload
endpoint is called./-/clear