prometheus / alertmanager

Prometheus Alertmanager
https://prometheus.io
Apache License 2.0
6.68k stars 2.17k forks source link

Alertmanager setup as a gossip cluster sends duplicate alert notifications during retries #4108

Open gmhegde86 opened 2 weeks ago

gmhegde86 commented 2 weeks ago

What did you do? Configured 2 alertmanagers (independent services) as HA gossip cluster. Used a webhook receiver to receive notifications. Webhook had an issue for a brief period, it was returning 500 series (retry-able error) to the alertmanagers for fired alerts and both the alertmanagers of the HA cluster kept retrying to send notifications. What did you expect to see? When the webhook issue was fixed and it was back operational, the expectation was to receive one notification for all the alerts which were in firing state. What did you see instead? Under which circumstances? Each alertmanager of the HA cluster sent one notification each, so duplicate alerts are received. Environment Kubernetes


* Prometheus configuration file:
NA

* Logs:

Alertmanager 1 of HA ts=2024-11-04T15:44:41.278Z caller=dispatch.go:164 level=debug component=dispatcher msg="Received alert" alert="High CPU Usage - Plan Dedicated[8c67d04][active]" ts=2024-11-04T15:44:41.615Z caller=notify.go:848 level=warn component=dispatcher receiver=dbaas-alerting-webhook integration=webhook[0] aggrGroup="{}:{alert_id=\"103\", cluster_id=\"188020\"}" msg="Notify attempt failed, will retry later" attempts=1 err="unexpected status code 500: https://aclp-alerting.iad3.us.prod.linode.com/monitor/alerts: {\"errors\": [{\"reason\": \"Service unavailable [1.10]\"}]}" ts=2024-11-04T15:45:24.610Z caller=notify.go:860 level=info component=dispatcher receiver=dbaas-alerting-webhook integration=webhook[0] aggrGroup="{}:{alert_id=\"103\", cluster_id=\"188020\"}" msg="Notify success" attempts=10 duration=1.182929464s

Alertmanager 2 of HA s=2024-11-04T15:44:41.281Z caller=dispatch.go:164 level=debug component=dispatcher msg="Received alert" alert="High CPU Usage - Plan Dedicated[8c67d04][active]" ts=2024-11-04T15:45:10.364Z caller=cluster.go:341 level=debug component=cluster memberlist="2024/11/04 15:45:10 [DEBUG] memberlist: Stream connection from=10.2.0.43:58784\n" ts=2024-11-04T15:45:24.867Z caller=nflog.go:533 level=debug component=nflog msg="gossiping new entry" entry="entry:<group_key:\"{}:{alert_id=\\"103\\", cluster_id=\\"188020\\"}\" receiver:<group_name:\"dbaas-alerting-webhook\" integration:\"webhook\" > timestamp: firing_alerts:11267836725140231328 > expires_at: " ts=2024-11-04T15:45:28.172Z caller=notify.go:860 level=info component=dispatcher receiver=dbaas-alerting-webhook integration=webhook[0] aggrGroup="{}:{alert_id=\"103\", cluster_id=\"188020\"}" msg="Notify success" attempts=12 duration=956.231897ms