What did you do?
Configured 2 alertmanagers (independent services) as HA gossip cluster. Used a webhook receiver to receive notifications. Webhook had an issue for a brief period, it was returning 500 series (retry-able error) to the alertmanagers for fired alerts and both the alertmanagers of the HA cluster kept retrying to send notifications.
What did you expect to see?
When the webhook issue was fixed and it was back operational, the expectation was to receive one notification for all the alerts which were in firing state.
What did you see instead? Under which circumstances?
Each alertmanager of the HA cluster sent one notification each, so duplicate alerts are received.
Environment
Kubernetes
What did you do? Configured 2 alertmanagers (independent services) as HA gossip cluster. Used a webhook receiver to receive notifications. Webhook had an issue for a brief period, it was returning 500 series (retry-able error) to the alertmanagers for fired alerts and both the alertmanagers of the HA cluster kept retrying to send notifications. What did you expect to see? When the webhook issue was fixed and it was back operational, the expectation was to receive one notification for all the alerts which were in firing state. What did you see instead? Under which circumstances? Each alertmanager of the HA cluster sent one notification each, so duplicate alerts are received. Environment Kubernetes
System information:
NA
Alertmanager version:
0.27.0
Prometheus version:
NA
Alertmanager configuration file:
Alertmanager 1 of HA ts=2024-11-04T15:44:41.278Z caller=dispatch.go:164 level=debug component=dispatcher msg="Received alert" alert="High CPU Usage - Plan Dedicated[8c67d04][active]" ts=2024-11-04T15:44:41.615Z caller=notify.go:848 level=warn component=dispatcher receiver=dbaas-alerting-webhook integration=webhook[0] aggrGroup="{}:{alert_id=\"103\", cluster_id=\"188020\"}" msg="Notify attempt failed, will retry later" attempts=1 err="unexpected status code 500: https://aclp-alerting.iad3.us.prod.linode.com/monitor/alerts: {\"errors\": [{\"reason\": \"Service unavailable [1.10]\"}]}" ts=2024-11-04T15:45:24.610Z caller=notify.go:860 level=info component=dispatcher receiver=dbaas-alerting-webhook integration=webhook[0] aggrGroup="{}:{alert_id=\"103\", cluster_id=\"188020\"}" msg="Notify success" attempts=10 duration=1.182929464s
Alertmanager 2 of HA s=2024-11-04T15:44:41.281Z caller=dispatch.go:164 level=debug component=dispatcher msg="Received alert" alert="High CPU Usage - Plan Dedicated[8c67d04][active]" ts=2024-11-04T15:45:10.364Z caller=cluster.go:341 level=debug component=cluster memberlist="2024/11/04 15:45:10 [DEBUG] memberlist: Stream connection from=10.2.0.43:58784\n" ts=2024-11-04T15:45:24.867Z caller=nflog.go:533 level=debug component=nflog msg="gossiping new entry" entry="entry:<group_key:\"{}:{alert_id=\\"103\\", cluster_id=\\"188020\\"}\" receiver:<group_name:\"dbaas-alerting-webhook\" integration:\"webhook\" > timestamp: firing_alerts:11267836725140231328 > expires_at: "
ts=2024-11-04T15:45:28.172Z caller=notify.go:860 level=info component=dispatcher receiver=dbaas-alerting-webhook integration=webhook[0] aggrGroup="{}:{alert_id=\"103\", cluster_id=\"188020\"}" msg="Notify success" attempts=12 duration=956.231897ms