prometheus / alertmanager

Prometheus Alertmanager
https://prometheus.io
Apache License 2.0
6.61k stars 2.14k forks source link

Alertmanager cluster send duplicate notification #2222

Open cycwll opened 4 years ago

cycwll commented 4 years ago

What did you do? 3 Prometheus nodes for HA 3 Alertmanager nodes for HA

alert01 startup command: /bin/alertmanager --config.file=/etc/alertmanager/config.yml --storage.path=/alertmanager --log.level=debug --cluster.listen-address=prom01:9094

alert02 startup command: /bin/alertmanager --config.file=/etc/alertmanager/config.yml --storage.path=/alertmanager --log.level=debug --cluster.listen-address=prom02:9094 --cluster.peer=prom01:9094

alert03 startup command: /bin/alertmanager --config.file=/etc/alertmanager/config.yml --storage.path=/alertmanager --log.level=debug --cluster.listen-address=prom03:9094 --cluster.peer=prom01:9094

And "Cluster Status" status is ready on all alertmanager node.

What did you expect to see? While instance down alert firing, just receiving one notification.

What did you see instead? Under which circumstances? While instance down alert firing, sometimes receiving two notification. (sometimes receiving one notification.)

Environment

alertmanager, version 0.20.0 (branch: HEAD, revision: f74be0400a6243d10bb53812d6fa408ad71ff32d) build user: root@00c3106655f8 build date: 20191211-14:13:14 go version: go1.13.5

prometheus, version 2.16.0 (branch: HEAD, revision: b90be6f32a33c03163d700e1452b54454ddce0ec) build user: root@7ea0ae865f12 build date: 20200213-23:50:02 go version: go1.13.8

1 level=debug ts=2020-03-31T23:54:24.565Z caller=nflog.go:540 component=nflog msg="gossiping new entry" entry="entry:<group_key:\"{}:{alertname=\\"InstanceDown\\"}\" receiver:<group_name:\"Atlassian\" integration:\"wechat\" > timestamp: firing_alerts:4666247465654023712 > expires_at: " level=debug ts=2020-03-31T23:54:43.838Z caller=cluster.go:306 component=cluster memberlist="2020/04/01 07:54:43 [DEBUG] memberlist: Initiating push/pull sync with: 10.188.53.37:9094\n" level=debug ts=2020-03-31T23:54:46.184Z caller=cluster.go:306 component=cluster memberlist="2020/04/01 07:54:46 [DEBUG] memberlist: Stream connection from=10.188.53.150:42816\n" level=debug ts=2020-03-31T23:54:53.580Z caller=dispatch.go:135 component=dispatcher msg="Received alert" alert=InstanceDown[e4a505b][active] level=debug ts=2020-03-31T23:54:53.580Z caller=dispatch.go:135 component=dispatcher msg="Received alert" alert=InstanceDown[e4a505b][active] level=debug ts=2020-03-31T23:54:53.583Z caller=dispatch.go:135 component=dispatcher msg="Received alert" alert=WinNodeFilesystemUsage[f5be04f][active] level=debug ts=2020-03-31T23:54:53.589Z caller=dispatch.go:135 component=dispatcher msg="Received alert" alert=WinNodeFilesystemUsage[f5be04f][active]

2 level=debug ts=2020-03-31T23:54:55.781Z caller=wechat.go:182 integration=wechat response="{\"errcode\":0,\"errmsg\":\"ok\",\"invaliduser\":\"\"}" incident="{}:{alertname=\"InstanceDown\"}"

alert03 logs: at 23:54:24, node-03 send a notification, and then, at 23:54:55, it received firing_alerts from alert node-01. 1 level=debug ts=2020-03-31T23:54:24.495Z caller=wechat.go:182 integration=wechat response="{\"errcode\":0,\"errmsg\":\"ok\",\"invaliduser\":\"\"}" incident="{}:{alertname=\"InstanceDown\"}" level=debug ts=2020-03-31T23:54:43.839Z caller=cluster.go:306 component=cluster memberlist="2020/04/01 07:54:43 [DEBUG] memberlist: Stream connection from=10.188.53.29:40128\n" level=debug ts=2020-03-31T23:54:53.579Z caller=dispatch.go:135 component=dispatcher msg="Received alert" alert=InstanceDown[e4a505b][active] level=debug ts=2020-03-31T23:54:53.579Z caller=dispatch.go:135 component=dispatcher msg="Received alert" alert=InstanceDown[e4a505b][active] level=debug ts=2020-03-31T23:54:53.583Z caller=dispatch.go:135 component=dispatcher msg="Received alert" alert=WinNodeFilesystemUsage[f5be04f][active] level=debug ts=2020-03-31T23:54:53.589Z caller=dispatch.go:135 component=dispatcher msg="Received alert" alert=WinNodeFilesystemUsage[f5be04f][active]

2 level=debug ts=2020-03-31T23:54:55.813Z caller=nflog.go:540 component=nflog msg="gossiping new entry" entry="entry:<group_key:\"{}:{alertname=\\"InstanceDown\\"}\" receiver:<group_name:\"Atlassian\" integration:\"wechat\" > timestamp: firing_alerts:4666247465654023712 > expires_at: "

alert02 log: received msg="gossiping new entry" from node-03 and node-01 at 23:54:24 and 23:54:55 respectively. level=debug ts=2020-03-31T23:54:24.564Z caller=nflog.go:540 component=nflog msg="gossiping new entry" entry="entry:<group_key:\"{}:{alertname=\\"InstanceDown\\"}\" receiver:<group_name:\"Atlassian\" integration:\"wechat\" > timestamp: firing_alerts:4666247465654023712 > expires_at: " level=debug ts=2020-03-31T23:54:46.184Z caller=cluster.go:306 component=cluster memberlist="2020/04/01 07:54:46 [DEBUG] memberlist: Initiating push/pull sync with: 10.188.53.29:9094\n" level=debug ts=2020-03-31T23:54:53.578Z caller=dispatch.go:135 component=dispatcher msg="Received alert" alert=InstanceDown[e4a505b][active] level=debug ts=2020-03-31T23:54:53.578Z caller=dispatch.go:135 component=dispatcher msg="Received alert" alert=InstanceDown[e4a505b][active] level=debug ts=2020-03-31T23:54:53.582Z caller=dispatch.go:135 component=dispatcher msg="Received alert" alert=WinNodeFilesystemUsage[f5be04f][active] level=debug ts=2020-03-31T23:54:53.589Z caller=dispatch.go:135 component=dispatcher msg="Received alert" alert=WinNodeFilesystemUsage[f5be04f][active] level=debug ts=2020-03-31T23:54:55.811Z caller=nflog.go:540 component=nflog msg="gossiping new entry" entry="entry:<group_key:\"{}:{alertname=\\"InstanceDown\\"}\" receiver:<group_name:\"Atlassian\" integration:\"wechat\" > timestamp: firing_alerts:4666247465654023712 > expires_at: "

devinodaniel commented 4 years ago

I had some luck with minimizing duplicate notifications by tweaking the --cluster.pushpull-interval and --cluster.gossip-interval flags in the alertmanager startup command to values other than the default. I started with defaults of 1m0s and 200ms respectively and vastly changed them until I got either more notifications or less, then slowly narrowed it down. It was quite painstaking.

To me, it seems to be related to the latency between the alertmanagers over the wire. For instance, I have 4 alertmangers communicating over a tunnel between NYC and CA and sometimes it's fast.. but sometimes, because of high ISP latency, their communication is slow. It would be nice to know if you have the same luck. I still get duplication of 2 to 3 notifications occasionally but I'd rather get multiple alerts than none.

cycwll commented 4 years ago

@devinodaniel thank for your help. My nodes is in a same LAN, so low latency between the alertmanagers over the wire. At present, I have a low probability of receiving repeated notifications(about 5%), according to your description, it seems that repeated notifications is inevitable, I will try your suggestions.

tianshimoyi commented 1 month ago

@devinodaniel Hello, I encountered the same problem. I changed the parameters to the following configuration and it still didn’t get better. Are there any other parameters that I need to pay attention to? My version is v0.24.0.