Alertmanager resends the notifications after reloading the configuration

ShreyanshMehta commented 1 year ago

How to recreate the scenario:

Configure the receiver and route. (I used slack as the receiver)
Send an alert to the alertmanager. You should receive some notification on your receiver.
Now, make changes in the alertmanager's configuration(try group_by or matchers) and call the /-/reload endpoint. Now, you will receive the notifications for the same alerts again.

NOTE: These behaviours are only observed when the same alerts exist in alertmanager's cache and the configuration is reloaded.

Expected Behavior:

We should not get notifications for alerts that we have previously received.

Solutions:

Alertmanager's cache should be cleaned once the /-/reload endpoint is called.
Alertmanager should additionally provide a way to remove its cache. Probably, a endpoint /-/clear
Alertmanager should additionally maintain track of the alerts that have previously been notified and deliver notifications in response to reloads.

RichardWarburton commented 1 year ago

It looks like this is the same issue as #3407

artli commented 1 year ago

I would say it's closer to #2492 and #3037, as my issue you reference is specifically about delivering incomplete notifications. That said, yeah, fundamentally all of these are due to the Alertmanager data model not playing well with config changes

ShreyanshMehta commented 1 year ago

Yeah. I agree with @artli.

ShreyanshMehta commented 1 year ago

@RichardWarburton I can work on this issue. Now, I am wondering what could be the right solution to this issue. I've been thinking about one of these solutions:

When the /-/reload endpoint is called, Alertmanager should send notifications based on the old configurations for all existing alerts in its cache and empty its cache.
Alertmanager should additionally maintain track of the alerts that have previously been notified and deliver notifications in response to reloads.

grobinson-grafana commented 1 year ago

We should not get notifications for alerts that we have previously received.

Would it be possible to share debug logs for this as I was not able to reproduce it in Alertmanager. I have the following route with group_wait, group_interval and repeat_interval:

route:
  receiver: email
  group_wait: 15s
  group_interval: 1m
  repeat_interval: 5m

Sent an alert to Alertmanager using cURL:

curl -H "Content-Type: application/json" http://127.0.0.1:9093/api/v2/alerts -d '[{"labels":{"foo":"bar"}}]'

ts=2023-07-05T14:29:47.287Z caller=cluster.go:700 level=info component=cluster msg="gossip settled; proceeding" elapsed=10.002189959s
ts=2023-07-05T14:29:59.496Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=[3fff2c2][active]
ts=2023-07-05T14:30:14.497Z caller=dispatch.go:515 level=debug component=dispatcher aggrGroup={}:{} msg=flushing alerts=[[3fff2c2][active]]
ts=2023-07-05T14:30:14.645Z caller=notify.go:752 level=debug component=dispatcher receiver=email integration=email[0] msg="Notify success" attempts=1

Reload Alertmanager using cURL:

curl -H "Content-Type: application/json" -XPOST http://127.0.0.1:9093/-/reload

The notification is not resent because repeat_interval (5m) has not elapsed:

ts=2023-07-05T14:30:27.988Z caller=coordinator.go:113 level=info component=configuration msg="Loading configuration file" file=config.yml
ts=2023-07-05T14:30:27.988Z caller=coordinator.go:126 level=info component=configuration msg="Completed loading of configuration file" file=config.yml
ts=2023-07-05T14:30:27.990Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=[3fff2c2][active]
ts=2023-07-05T14:30:27.990Z caller=dispatch.go:515 level=debug component=dispatcher aggrGroup={}:{} msg=flushing alerts=[[3fff2c2][active]]
ts=2023-07-05T14:31:27.989Z caller=dispatch.go:515 level=debug component=dispatcher aggrGroup={}:{} msg=flushing alerts=[[3fff2c2][active]]

grobinson-grafana commented 1 year ago

Just following up to the previous message I think I understand the issue.

The aggregation group needs to change (i.e. a new alert is added to an existing group) between the last notification and the configuration being reloaded. When the group is flushed a new notification will be sent because https://github.com/prometheus/alertmanager/blob/main/notify/notify.go#L574.

ts=2023-07-05T14:37:58.337Z caller=cluster.go:700 level=info component=cluster msg="gossip settled; proceeding" elapsed=10.002223083s
ts=2023-07-05T14:37:58.628Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=[3fff2c2][active]
ts=2023-07-05T14:38:13.628Z caller=dispatch.go:515 level=debug component=dispatcher aggrGroup={}:{} msg=flushing alerts=[[3fff2c2][active]]
ts=2023-07-05T14:38:13.766Z caller=notify.go:752 level=debug component=dispatcher receiver=email integration=email[0] msg="Notify success" attempts=1
ts=2023-07-05T14:38:15.402Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=[6ad8c19][active]
ts=2023-07-05T14:38:20.339Z caller=coordinator.go:113 level=info component=configuration msg="Loading configuration file" file=config.yml
ts=2023-07-05T14:38:20.339Z caller=coordinator.go:126 level=info component=configuration msg="Completed loading of configuration file" file=config.yml
ts=2023-07-05T14:38:20.341Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=[3fff2c2][active]
ts=2023-07-05T14:38:20.341Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=[6ad8c19][active]
ts=2023-07-05T14:38:20.341Z caller=dispatch.go:515 level=debug component=dispatcher aggrGroup={}:{} msg=flushing alerts="[[3fff2c2][active] [6ad8c19][active]]"
ts=2023-07-05T14:38:20.472Z caller=notify.go:752 level=debug component=dispatcher receiver=email integration=email[0] msg="Notify success" attempts=1

ShreyanshMehta commented 1 year ago

@grobinson-grafana To reproduce the issue, you should use the group_by field in your route and send some alerts. Then immediately, make changes in that group_by condition (add or remove the labels) and call the /-/reload endpoint. You will receive same alerts at your receiver again.

grobinson-grafana commented 1 year ago

Thanks, I was also able to reproduce the issue here!

artli commented 1 year ago

Yeah, this one's probably hard to fix because after you change the core config of your group, Alertmanager pretty much starts treating it as a completely different group. This makes some sense because if you change group_by or matchers in some of your groups, the way your existing alerts are grouped can change drastically, so it's very hard to think of a general way to correlate the groups before the change with the groups after the change.

As an illustration, you can see that the notification log (which is used for deduplicating notifications) keys notifications by groupKey + receiverKey: https://github.com/prometheus/alertmanager/blob/487db1383b8cc5c2867c77f110431605bb8ce247/nflog/nflog.go#L442C3-L442C3. And groupKey depends on the labels of this particular instance of the group, which obviously depend on what you've configured it to group by, (https://github.com/prometheus/alertmanager/blob/487db1383b8cc5c2867c77f110431605bb8ce247/dispatch/dispatch.go#L358) and also on routeKey, which depends on the matchers specified in the config (https://github.com/prometheus/alertmanager/blob/487db1383b8cc5c2867c77f110431605bb8ce247/dispatch/route.go#L171C6-L171C6). I don't think there's an easy way around this except accepting that a change in group configuration essentially creates a new group.

artli commented 1 year ago

When the /-/reload endpoint is called, Alertmanager should send notifications based on the old configurations for all existing alerts in its cache and empty its cache

IIUC, this would mean that after the config is reloaded, Alertmanager will essentially forget about any alerts it has seen before, which sounds pretty undesirable. For example, if alerts are generally being pushed into Alertmanager every minute with GroupWait=30s, if a config reload makes Alertmanager forget all alerts it knows about, after 30s it might start sending incorrect notifications based on an incomplete understanding of the world.

ShreyanshMehta commented 1 year ago

Agreed @artli.

Alertmanager should additionally maintain track of the alerts that have previously been notified and deliver notifications in response to reloads.

I feel this could be a ideal fix for this issue. We should maintain a cache of type map[string]struct{} where the key will be the fingerprint of the alerts. Just before delivering any alert, we can do a lookup in the cache and send notification accordingly.

artli commented 1 year ago

Unfortunately, I don't think that would work (or I don't understand the suggestion fully) because sometimes the routing tree is composed in such a way that a single alert needs to be delivered to multiple receivers through multiple aggregation groups. This can be the case if you use "continue": true in your routing tree. Moreover, when the config is reloaded, it's totally possible that the new config routes an existing alert to a completely new aggregation group with different receivers and a different grouping, so we do actually need to send a new notification for it even if we've already sent one before.

SowmiyaJeevanandham commented 3 months ago

I installed Prometheus with default email config credentials and then i tried to change the credentials and reloaded the alertmanager. My emails are sending in both older and new credentials. My router intervals are: route: group_interval: 1m group_wait: 10s receiver: email-receiver repeat_interval: 5m

I need to send email only on updated email id's. Can you please check and let us know why

prometheus / alertmanager