Open cycwll opened 4 years ago
I had some luck with minimizing duplicate notifications by tweaking the --cluster.pushpull-interval
and --cluster.gossip-interval
flags in the alertmanager startup command to values other than the default. I started with defaults of 1m0s
and 200ms
respectively and vastly changed them until I got either more notifications or less, then slowly narrowed it down. It was quite painstaking.
To me, it seems to be related to the latency between the alertmanagers over the wire. For instance, I have 4 alertmangers communicating over a tunnel between NYC and CA and sometimes it's fast.. but sometimes, because of high ISP latency, their communication is slow. It would be nice to know if you have the same luck. I still get duplication of 2 to 3 notifications occasionally but I'd rather get multiple alerts than none.
@devinodaniel thank for your help. My nodes is in a same LAN, so low latency between the alertmanagers over the wire. At present, I have a low probability of receiving repeated notifications(about 5%), according to your description, it seems that repeated notifications is inevitable, I will try your suggestions.
@devinodaniel Hello, I encountered the same problem. I changed the parameters to the following configuration and it still didn’t get better. Are there any other parameters that I need to pay attention to? My version is v0.24.0.
What did you do? 3 Prometheus nodes for HA 3 Alertmanager nodes for HA
alert01 startup command:
/bin/alertmanager --config.file=/etc/alertmanager/config.yml --storage.path=/alertmanager --log.level=debug --cluster.listen-address=prom01:9094
alert02 startup command:
/bin/alertmanager --config.file=/etc/alertmanager/config.yml --storage.path=/alertmanager --log.level=debug --cluster.listen-address=prom02:9094 --cluster.peer=prom01:9094
alert03 startup command:
/bin/alertmanager --config.file=/etc/alertmanager/config.yml --storage.path=/alertmanager --log.level=debug --cluster.listen-address=prom03:9094 --cluster.peer=prom01:9094
And "Cluster Status" status is ready on all alertmanager node.
What did you expect to see? While instance down alert firing, just receiving one notification.
What did you see instead? Under which circumstances? While instance down alert firing, sometimes receiving two notification. (sometimes receiving one notification.)
Environment
System information:
Linux 4.12.14-94.41-default x86_64
Alertmanager version:
alertmanager, version 0.20.0 (branch: HEAD, revision: f74be0400a6243d10bb53812d6fa408ad71ff32d) build user: root@00c3106655f8 build date: 20191211-14:13:14 go version: go1.13.5
prometheus, version 2.16.0 (branch: HEAD, revision: b90be6f32a33c03163d700e1452b54454ddce0ec) build user: root@7ea0ae865f12 build date: 20200213-23:50:02 go version: go1.13.8
Alertmanager configuration file:
Prometheus configuration file:
Logs:
1 level=debug ts=2020-03-31T23:54:24.565Z caller=nflog.go:540 component=nflog msg="gossiping new entry" entry="entry:<group_key:\"{}:{alertname=\\"InstanceDown\\"}\" receiver:<group_name:\"Atlassian\" integration:\"wechat\" > timestamp: firing_alerts:4666247465654023712 > expires_at: "
level=debug ts=2020-03-31T23:54:43.838Z caller=cluster.go:306 component=cluster memberlist="2020/04/01 07:54:43 [DEBUG] memberlist: Initiating push/pull sync with: 10.188.53.37:9094\n"
level=debug ts=2020-03-31T23:54:46.184Z caller=cluster.go:306 component=cluster memberlist="2020/04/01 07:54:46 [DEBUG] memberlist: Stream connection from=10.188.53.150:42816\n"
level=debug ts=2020-03-31T23:54:53.580Z caller=dispatch.go:135 component=dispatcher msg="Received alert" alert=InstanceDown[e4a505b][active]
level=debug ts=2020-03-31T23:54:53.580Z caller=dispatch.go:135 component=dispatcher msg="Received alert" alert=InstanceDown[e4a505b][active]
level=debug ts=2020-03-31T23:54:53.583Z caller=dispatch.go:135 component=dispatcher msg="Received alert" alert=WinNodeFilesystemUsage[f5be04f][active]
level=debug ts=2020-03-31T23:54:53.589Z caller=dispatch.go:135 component=dispatcher msg="Received alert" alert=WinNodeFilesystemUsage[f5be04f][active]
2 level=debug ts=2020-03-31T23:54:55.781Z caller=wechat.go:182 integration=wechat response="{\"errcode\":0,\"errmsg\":\"ok\",\"invaliduser\":\"\"}" incident="{}:{alertname=\"InstanceDown\"}"
alert03 logs: at 23:54:24, node-03 send a notification, and then, at 23:54:55, it received firing_alerts from alert node-01. 1 level=debug ts=2020-03-31T23:54:24.495Z caller=wechat.go:182 integration=wechat response="{\"errcode\":0,\"errmsg\":\"ok\",\"invaliduser\":\"\"}" incident="{}:{alertname=\"InstanceDown\"}" level=debug ts=2020-03-31T23:54:43.839Z caller=cluster.go:306 component=cluster memberlist="2020/04/01 07:54:43 [DEBUG] memberlist: Stream connection from=10.188.53.29:40128\n" level=debug ts=2020-03-31T23:54:53.579Z caller=dispatch.go:135 component=dispatcher msg="Received alert" alert=InstanceDown[e4a505b][active] level=debug ts=2020-03-31T23:54:53.579Z caller=dispatch.go:135 component=dispatcher msg="Received alert" alert=InstanceDown[e4a505b][active] level=debug ts=2020-03-31T23:54:53.583Z caller=dispatch.go:135 component=dispatcher msg="Received alert" alert=WinNodeFilesystemUsage[f5be04f][active] level=debug ts=2020-03-31T23:54:53.589Z caller=dispatch.go:135 component=dispatcher msg="Received alert" alert=WinNodeFilesystemUsage[f5be04f][active]
2 level=debug ts=2020-03-31T23:54:55.813Z caller=nflog.go:540 component=nflog msg="gossiping new entry" entry="entry:<group_key:\"{}:{alertname=\\"InstanceDown\\"}\" receiver:<group_name:\"Atlassian\" integration:\"wechat\" > timestamp: firing_alerts:4666247465654023712 > expires_at: "
alert02 log: received msg="gossiping new entry" from node-03 and node-01 at 23:54:24 and 23:54:55 respectively. level=debug ts=2020-03-31T23:54:24.564Z caller=nflog.go:540 component=nflog msg="gossiping new entry" entry="entry:<group_key:\"{}:{alertname=\\"InstanceDown\\"}\" receiver:<group_name:\"Atlassian\" integration:\"wechat\" > timestamp: firing_alerts:4666247465654023712 > expires_at: "
level=debug ts=2020-03-31T23:54:46.184Z caller=cluster.go:306 component=cluster memberlist="2020/04/01 07:54:46 [DEBUG] memberlist: Initiating push/pull sync with: 10.188.53.29:9094\n"
level=debug ts=2020-03-31T23:54:53.578Z caller=dispatch.go:135 component=dispatcher msg="Received alert" alert=InstanceDown[e4a505b][active]
level=debug ts=2020-03-31T23:54:53.578Z caller=dispatch.go:135 component=dispatcher msg="Received alert" alert=InstanceDown[e4a505b][active]
level=debug ts=2020-03-31T23:54:53.582Z caller=dispatch.go:135 component=dispatcher msg="Received alert" alert=WinNodeFilesystemUsage[f5be04f][active]
level=debug ts=2020-03-31T23:54:53.589Z caller=dispatch.go:135 component=dispatcher msg="Received alert" alert=WinNodeFilesystemUsage[f5be04f][active]
level=debug ts=2020-03-31T23:54:55.811Z caller=nflog.go:540 component=nflog msg="gossiping new entry" entry="entry:<group_key:\"{}:{alertname=\\"InstanceDown\\"}\" receiver:<group_name:\"Atlassian\" integration:\"wechat\" > timestamp: firing_alerts:4666247465654023712 > expires_at: "