Open joernkleinbub opened 3 years ago
This is something that I have experienced before. I think it has to do with the ratio between group_wait
and group_interval
configured. Also external factors like network latency between your alertmanagers can play a role.
I had some good experience with setting the group_interval
to at least 4 times the value of group_wait
.
Let me know if that solves those immediate issues and if you still experience resends I would suggest to raise it even further. Meanwhile I will have a try to understand the related code sections and see if there is more that can be done.
I had some good experience with setting the group_interval to at least 4 times the value of group_wait."
Thanks. That solves the problem. Now, I get the alert and the resolve notification only once.
The Alertmanager config is now looking like that:
...
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 40s
repeat_interval: 1h
receiver: 'web.hook'
...
Thanks for the detailed report. Yes having short group_interval
values is prone to notifications being resent after they resolve. This is because of how the clustering works: every Alertmanager instance will delay the group processing based on its position in the cluster. In practice, the first instance will start immediatly, the second instance will wait 115 seconds, the third instance will wait 215 seconds, and so on.
You can see it from the n3 logs:
level=debug ts=2021-01-07T08:38:55.045Z caller=dispatch.go:473 component=dispatcher aggrGroup="{}:{alertname=\"DemoAlert\"}" msg=flushing alerts=[DemoAlert[8a07086][active]]
level=debug ts=2021-01-07T08:39:00.038Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert=DemoAlert[8a07086][resolved]
level=debug ts=2021-01-07T08:39:00.227Z caller=nflog.go:540 component=nflog msg="gossiping new entry" entry="entry:<group_key:\"{}:{alertname=\\\"DemoAlert\\\"}\" receiver:<group_name:\"web.hook\" integration:\"webhook\" > timestamp:<seconds:1610008740 nanos:188592422 > resolved_alerts:3742573726221535510 > expires_at:<seconds:1610440740 nanos:188592422 > "
level=debug ts=2021-01-07T08:39:10.618Z caller=notify.go:685 component=dispatcher receiver=web.hook integration=webhook[0] msg="Notify success" attempts=1
08:38:55045
Alertmanager flushes the aggregation group while the alert is still active.08:39:00.038
it gets the resolved alert.08:39:00.227
it gets the log from the other Alertmanager instance that already sent the resolved notification.08:39:10.618
it sends the firing notification that was initiated at 08:38:55045
.The 15 seconds value comes from the --cluster.peer-timeout
argument and you could tweak it but at the risk of getting even more duplicated notifications if your receiver or the replication of notificaiton logs between instances lags behind.
I keep the issue open, it's something that Alertmanager should at least be able to surface to the users.
@simonpasquier that makes a lot of sense. I spent quite a bit of time understanding what's going on and this is roughly matches with the understanding I got.
I was thinking one possible way of avoiding this "peer-timeout delayed notifications" to generate duplicates would be if we would keep look into more nflog.Entries
instead of only the last one per groupKey
(https://github.com/prometheus/alertmanager/blob/e6824a31100bd59308ad769a71371274455c0914/nflog/nflog.go#L450
), as part of the DedupStage
notification step.
I think if we would consider looking at the latest nflog.Entries
per alertHash
per groupKey
, we should be able to compare if an alert has been resolved in the meantime.
Do you think this could be a sensible approach?
I would need to put more thoughts into this but the untangible principle is that the recipient is guaranteed to get the eventual state of the notification. To that respect, the current trade-off is indeed that the remote system can receive conflicting notifications sometimes.
What did you do? I setted up a Prometheus and Alertmanager in HA on 3 Vagrant nodes on my local machine for testing purposes. I established a single DemoAlert "sum(prometheus_build_info) < 3" via WebHook and provoke a raise of the alert by killing one of the prometheus.
What did you expect to see? Chronological:
What did you see instead? Under which circumstances? Chronological (Notifications and Logs please see below):
This behaviour is reproducable generally. Alerts are resend always after the resolve notification.
Environment
System information: Linux 4.9.0-12-amd64 x86_64 Vagrant/Virtualbox nodes
Alertmanager version: vagrant@n3:~$ alertmanager-0.21.0.linux-amd64/alertmanager --version alertmanager, version 0.21.0 (branch: HEAD, revision: 4c6c03ebfe21009c546e4d1e9b92c371d67c021d) build user: root@dee35927357f build date: 20200617-08:54:02 go version: go1.14.4
Prometheus version: vagrant@n3:~$ prometheus-2.24.0-rc.0.linux-amd64/prometheus --version prometheus, version 2.24.0-rc.0 (branch: HEAD, revision: 0cb133aaa5c3bcb0fcbfb0b34c256fd3483a9bfd) build user: root@fdb12576957f build date: 20201230-15:28:10 go version: go1.15.6 platform: linux/amd64
Alertmanager configuration file: Start example:
./alertmanager-0.21.0.linux-amd64/alertmanager --cluster.advertise-address=172.20.20.10:9094 --log.level=debug --config.file=./alertmanager-0.21.0.linux-amd64/alertmanager.yml --cluster.peer=172.20.20.11:9094 --cluster.peer=172.20.20.12:9094 &>> alertmanager.log &
Prometheus configuration file: Start example:
./prometheus-2.24.0-rc.0.linux-amd64/prometheus --config.file=./prometheus-2.24.0-rc.0.linux-amd64/prometheus.yml --web.enable-lifecycle &
rule file:
Logs: node '172.20.20.10:9090' = n1
node '172.20.20.11:9090' = n2
Alertmanager node '172.20.20.12:9090' = n3
WebHooksEvents from Pipedream Alert Notification 2021-01-07T09:38:11.059+01:00 from "http://n3:9093"
Resolved Notification 2021-01-07T09:39:00.175+01:00 from "http://n3:9093"
Alert Notification 2021-01-07T09:39:10.572+01:00 from "http://n2:9093"
Alert Notification 2021-01-07T09:39:10.627+01:00 from "http://n1:9093"
Resolve Notification 2021-01-07T09:39:25.801+01:00 from "http://n1:9093"