Alertmanager setup as a gossip cluster sends duplicate alert notifications during retries

What did you do? Configured 2 alertmanagers (independent services) as HA gossip cluster. Used a webhook receiver to receive notifications. Webhook had an issue for a brief period, it was returning 500 series (retry-able error) to the alertmanagers for fired alerts and both the alertmanagers of the HA cluster kept retrying to send notifications. What did you expect to see? When the webhook issue was fixed and it was back operational, the expectation was to receive one notification for all the alerts which were in firing state. What did you see instead? Under which circumstances? Each alertmanager of the HA cluster sent one notification each, so duplicate alerts are received. Environment Kubernetes

System information:

NA
Alertmanager version:

0.27.0
Prometheus version:

NA

Alertmanager configuration file:


global:
  resolve_timeout: 12h
route:
  group_by: [alert_id, cluster_id]
  receiver: dbaas-alerting-webhook
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 12h
receivers:
- name: dbaas-alerting-webhook
  webhook_configs:
  - url: https://aclp-alerting.xxxx.com/monitor/alerts
    send_resolved: true
    http_config:
      tls_config:
        ca_file: /etc/vm/secrets/cloud-observability-ca/ca.crt
        cert_file: /etc/vm/secrets/vmalertmanager-tls/tls.crt
        key_file: /etc/vm/secrets/vmalertmanager-tls/tls.key


* Prometheus configuration file:
NA

* Logs:

Alertmanager 1 of HA ts=2024-11-04T15:44:41.278Z caller=dispatch.go:164 level=debug component=dispatcher msg="Received alert" alert="High CPU Usage - Plan Dedicated[8c67d04][active]" ts=2024-11-04T15:44:41.615Z caller=notify.go:848 level=warn component=dispatcher receiver=dbaas-alerting-webhook integration=webhook[0] aggrGroup="{}:{alert_id=\"103\", cluster_id=\"188020\"}" msg="Notify attempt failed, will retry later" attempts=1 err="unexpected status code 500: https://aclp-alerting.iad3.us.prod.linode.com/monitor/alerts: {\"errors\": [{\"reason\": \"Service unavailable [1.10]\"}]}" ts=2024-11-04T15:45:24.610Z caller=notify.go:860 level=info component=dispatcher receiver=dbaas-alerting-webhook integration=webhook[0] aggrGroup="{}:{alert_id=\"103\", cluster_id=\"188020\"}" msg="Notify success" attempts=10 duration=1.182929464s

Alertmanager 2 of HA s=2024-11-04T15:44:41.281Z caller=dispatch.go:164 level=debug component=dispatcher msg="Received alert" alert="High CPU Usage - Plan Dedicated[8c67d04][active]" ts=2024-11-04T15:45:10.364Z caller=cluster.go:341 level=debug component=cluster memberlist="2024/11/04 15:45:10 [DEBUG] memberlist: Stream connection from=10.2.0.43:58784\n" ts=2024-11-04T15:45:24.867Z caller=nflog.go:533 level=debug component=nflog msg="gossiping new entry" entry="entry:<group_key:\"{}:{alert_id=\\"103\\", cluster_id=\\"188020\\"}\" receiver:<group_name:\"dbaas-alerting-webhook\" integration:\"webhook\" > timestamp: firing_alerts:11267836725140231328 > expires_at: " ts=2024-11-04T15:45:28.172Z caller=notify.go:860 level=info component=dispatcher receiver=dbaas-alerting-webhook integration=webhook[0] aggrGroup="{}:{alert_id=\"103\", cluster_id=\"188020\"}" msg="Notify success" attempts=12 duration=956.231897ms

prometheus / alertmanager

Alertmanager setup as a gossip cluster sends duplicate alert notifications during retries #4108