Alertmanager of other nodes resends same alerts again after resolve notification

joernkleinbub commented 3 years ago

What did you do? I setted up a Prometheus and Alertmanager in HA on 3 Vagrant nodes on my local machine for testing purposes. I established a single DemoAlert "sum(prometheus_build_info) < 3" via WebHook and provoke a raise of the alert by killing one of the prometheus.

What did you expect to see? Chronological:

Prometheus is killed
The alert is fired.
Prometheus is up again
An resolved message.

What did you see instead? Under which circumstances? Chronological (Notifications and Logs please see below):

Prometheus is killed
An alert notification is fired by Alertmanager "http://n3:9093" 2021-01-07T09:38:11.059+01:00
Prometheus is up again
An resolved notification by Alertmanager "http://n3:9093": 2021-01-07T09:39:00.175+01:00
The same alert notification (2.) is fired again by Alertmanger "http://n2:9093": 2021-01-07T09:39:10.572+01:00
The same alert notification (2.) is fired again by Alertmanger "http://n1:9093": 2021-01-07T09:39:10.627+01:00
A resolve message is fired by Alertmanger "http://n1:9093": 2021-01-07T09:39:25.801+01:00

This behaviour is reproducable generally. Alerts are resend always after the resolve notification.

Environment

System information: Linux 4.9.0-12-amd64 x86_64 Vagrant/Virtualbox nodes
Alertmanager version: vagrant@n3:~$ alertmanager-0.21.0.linux-amd64/alertmanager --version alertmanager, version 0.21.0 (branch: HEAD, revision: 4c6c03ebfe21009c546e4d1e9b92c371d67c021d) build user: root@dee35927357f build date: 20200617-08:54:02 go version: go1.14.4
Prometheus version: vagrant@n3:~$ prometheus-2.24.0-rc.0.linux-amd64/prometheus --version prometheus, version 2.24.0-rc.0 (branch: HEAD, revision: 0cb133aaa5c3bcb0fcbfb0b34c256fd3483a9bfd) build user: root@fdb12576957f build date: 20201230-15:28:10 go version: go1.15.6 platform: linux/amd64

Alertmanager configuration file: Start example: ./alertmanager-0.21.0.linux-amd64/alertmanager --cluster.advertise-address=172.20.20.10:9094 --log.level=debug --config.file=./alertmanager-0.21.0.linux-amd64/alertmanager.yml --cluster.peer=172.20.20.11:9094 --cluster.peer=172.20.20.12:9094 &>> alertmanager.log &

global:
resolve_timeout: 5m
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'web.hook'
receivers:
- name: 'web.hook'
webhook_configs:
  - url: 'https://a94434b3d427cb85c35eb4ad0be45e6d.m.pipedream.net/'
inhibit_rules:
- source_match:
  severity: 'critical'
target_match:
  severity: 'warning'
equal: ['alertname', 'dev', 'instance']

Prometheus configuration file: Start example: ./prometheus-2.24.0-rc.0.linux-amd64/prometheus --config.file=./prometheus-2.24.0-rc.0.linux-amd64/prometheus.yml --web.enable-lifecycle &

global:
scrape_interval:     35s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 1m # Evaluate rules every 15 seconds. The default is every 1 minute.
external_labels:
alerting:
alertmanagers:
- static_configs:
    - targets:
        - 172.20.20.10:9093
        - 172.20.20.11:9093
        - 172.20.20.12:9093
rule_files:
- "prometheus.rules.yml"
scrape_configs:
- job_name: 'prometheus'
static_configs:
  - targets: ['172.20.20.10:9090','172.20.20.11:9090','172.20.20.12:9090']
- job_name: 'alertmanager'
static_configs:
  - targets: ['172.20.20.10:9093','172.20.20.11:9093','172.20.20.12:9093']

rule file:

- name: ThisIsDemo
rules:
  - alert: DemoAlert
    expr: sum(prometheus_build_info) < 3
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: This is a demo

Logs: node '172.20.20.10:9090' = n1

vagrant@n3:~$ tail -f alertmanager.log
level=debug ts=2021-01-07T08:36:02.048Z caller=cluster.go:306 component=cluster memberlist="2021/01/07 08:36:02 [DEBUG] memberlist: Stream connection from=172.20.20.10:37426\n"
level=debug ts=2021-01-07T08:36:43.179Z caller=cluster.go:306 component=cluster memberlist="2021/01/07 08:36:43 [DEBUG] memberlist: Initiating push/pull sync with: 01EVDY7T8KDZ7SG98H27F3EG9Y 172.20.20.10:9094\n"
level=debug ts=2021-01-07T08:37:43.181Z caller=cluster.go:306 component=cluster memberlist="2021/01/07 08:37:43 [DEBUG] memberlist: Initiating push/pull sync with: 01EVDZSRDH74YHZ0RPDC3TRE38 172.20.20.11:9094\n"
level=debug ts=2021-01-07T08:37:49.022Z caller=silence.go:350 component=silences msg="Running maintenance"
level=debug ts=2021-01-07T08:37:49.023Z caller=nflog.go:336 component=nflog msg="Running maintenance"
level=debug ts=2021-01-07T08:37:49.025Z caller=silence.go:352 component=silences msg="Maintenance done" duration=2.576376ms size=226
level=debug ts=2021-01-07T08:37:49.027Z caller=nflog.go:338 component=nflog msg="Maintenance done" duration=4.352509ms size=282
level=debug ts=2021-01-07T08:37:53.609Z caller=cluster.go:306 component=cluster memberlist="2021/01/07 08:37:53 [DEBUG] memberlist: Stream connection from=172.20.20.11:59302\n"
level=debug ts=2021-01-07T08:38:00.034Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert=DemoAlert[8a07086][active]
level=debug ts=2021-01-07T08:38:00.039Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert=DemoAlert[8a07086][active]
level=debug ts=2021-01-07T08:38:10.034Z caller=dispatch.go:473 component=dispatcher aggrGroup="{}:{alertname=\"DemoAlert\"}" msg=flushing alerts=[DemoAlert[8a07086][active]]
level=debug ts=2021-01-07T08:38:11.071Z caller=notify.go:685 component=dispatcher receiver=web.hook integration=webhook[0] msg="Notify success" attempts=1
level=debug ts=2021-01-07T08:38:20.035Z caller=dispatch.go:473 component=dispatcher aggrGroup="{}:{alertname=\"DemoAlert\"}" msg=flushing alerts=[DemoAlert[8a07086][active]]
level=debug ts=2021-01-07T08:38:30.035Z caller=dispatch.go:473 component=dispatcher aggrGroup="{}:{alertname=\"DemoAlert\"}" msg=flushing alerts=[DemoAlert[8a07086][active]]
level=debug ts=2021-01-07T08:38:40.036Z caller=dispatch.go:473 component=dispatcher aggrGroup="{}:{alertname=\"DemoAlert\"}" msg=flushing alerts=[DemoAlert[8a07086][active]]
level=debug ts=2021-01-07T08:38:43.183Z caller=cluster.go:306 component=cluster memberlist="2021/01/07 08:38:43 [DEBUG] memberlist: Initiating push/pull sync with: 01EVDY7T8KDZ7SG98H27F3EG9Y 172.20.20.10:9094\n"
level=debug ts=2021-01-07T08:38:50.036Z caller=dispatch.go:473 component=dispatcher aggrGroup="{}:{alertname=\"DemoAlert\"}" msg=flushing alerts=[DemoAlert[8a07086][active]]
level=debug ts=2021-01-07T08:39:00.034Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert=DemoAlert[8a07086][resolved]
level=debug ts=2021-01-07T08:39:00.036Z caller=dispatch.go:473 component=dispatcher aggrGroup="{}:{alertname=\"DemoAlert\"}" msg=flushing alerts=[DemoAlert[8a07086][resolved]]
level=debug ts=2021-01-07T08:39:00.188Z caller=notify.go:685 component=dispatcher receiver=web.hook integration=webhook[0] msg="Notify success" attempts=1
level=debug ts=2021-01-07T08:39:02.067Z caller=cluster.go:306 component=cluster memberlist="2021/01/07 08:39:02 [DEBUG] memberlist: Stream connection from=172.20.20.10:37446\n"
level=debug ts=2021-01-07T08:39:10.655Z caller=nflog.go:540 component=nflog msg="gossiping new entry" entry="entry:<group_key:\"{}:{alertname=\\\"DemoAlert\\\"}\" receiver:<group_name:\"web.hook\" integration:\"webhook\" > timestamp:<seconds:1610008750 nanos:594514657 > firing_alerts:3742573726221535510 > expires_at:<seconds:1610440750 nanos:594514657 > "
level=debug ts=2021-01-07T08:39:10.744Z caller=nflog.go:540 component=nflog msg="gossiping new entry" entry="entry:<group_key:\"{}:{alertname=\\\"DemoAlert\\\"}\" receiver:<group_name:\"web.hook\" integration:\"webhook\" > timestamp:<seconds:1610008750 nanos:618452284 > firing_alerts:3742573726221535510 > expires_at:<seconds:1610440750 nanos:618452284 > "
level=debug ts=2021-01-07T08:39:25.945Z caller=nflog.go:540 component=nflog msg="gossiping new entry" entry="entry:<group_key:\"{}:{alertname=\\\"DemoAlert\\\"}\" receiver:<group_name:\"web.hook\" integration:\"webhook\" > timestamp:<seconds:1610008765 nanos:822504049 > resolved_alerts:3742573726221535510 > expires_at:<seconds:1610440765 nanos:822504049 > "

node '172.20.20.11:9090' = n2

level=debug ts=2021-01-07T08:35:51.885Z caller=main.go:355 externalURL=http://n2:9093
level=info ts=2021-01-07T08:35:51.885Z caller=coordinator.go:119 component=configuration msg="Loading configuration file" file=./alertmanager-0.21.0.linux-amd64/alertmanager.yml
level=info ts=2021-01-07T08:35:51.885Z caller=coordinator.go:131 component=configuration msg="Completed loading of configuration file" file=./alertmanager-0.21.0.linux-amd64/alertmanager.yml
level=debug ts=2021-01-07T08:35:51.886Z caller=main.go:465 routePrefix=/
level=info ts=2021-01-07T08:35:51.886Z caller=main.go:485 msg=Listening address=:9093
level=info ts=2021-01-07T08:35:53.865Z caller=cluster.go:648 component=cluster msg="gossip not settled" polls=0 before=0 now=3 elapsed=2.000233918s
level=debug ts=2021-01-07T08:35:55.866Z caller=cluster.go:645 component=cluster msg="gossip looks settled" elapsed=4.000945507s
level=debug ts=2021-01-07T08:35:57.866Z caller=cluster.go:645 component=cluster msg="gossip looks settled" elapsed=6.001335775s
level=debug ts=2021-01-07T08:35:59.866Z caller=cluster.go:645 component=cluster msg="gossip looks settled" elapsed=8.001837012s
level=info ts=2021-01-07T08:36:01.869Z caller=cluster.go:640 component=cluster msg="gossip settled; proceeding" elapsed=10.004591595s
level=debug ts=2021-01-07T08:36:53.607Z caller=cluster.go:306 component=cluster memberlist="2021/01/07 08:36:53 [DEBUG] memberlist: Initiating push/pull sync with: 01EVDY7T8KDZ7SG98H27F3EG9Y 172.20.20.10:9094\n"
level=debug ts=2021-01-07T08:37:02.057Z caller=cluster.go:306 component=cluster memberlist="2021/01/07 08:37:02 [DEBUG] memberlist: Stream connection from=172.20.20.10:59306\n"
level=debug ts=2021-01-07T08:37:43.185Z caller=cluster.go:306 component=cluster memberlist="2021/01/07 08:37:43 [DEBUG] memberlist: Stream connection from=172.20.20.12:55954\n"
level=debug ts=2021-01-07T08:37:53.612Z caller=cluster.go:306 component=cluster memberlist="2021/01/07 08:37:53 [DEBUG] memberlist: Initiating push/pull sync with: 01EVDY6D0YSSM8DM6G8Z8DMF95 172.20.20.12:9094\n"
level=debug ts=2021-01-07T08:38:00.039Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert=DemoAlert[8a07086][active]
level=debug ts=2021-01-07T08:38:00.042Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert=DemoAlert[8a07086][active]
level=debug ts=2021-01-07T08:38:02.061Z caller=cluster.go:306 component=cluster memberlist="2021/01/07 08:38:02 [DEBUG] memberlist: Stream connection from=172.20.20.10:59318\n"
level=debug ts=2021-01-07T08:38:10.040Z caller=dispatch.go:473 component=dispatcher aggrGroup="{}:{alertname=\"DemoAlert\"}" msg=flushing alerts=[DemoAlert[8a07086][active]]
level=debug ts=2021-01-07T08:38:11.229Z caller=nflog.go:540 component=nflog msg="gossiping new entry" entry="entry:<group_key:\"{}:{alertname=\\\"DemoAlert\\\"}\" receiver:<group_name:\"web.hook\" integration:\"webhook\" > timestamp:<seconds:1610008691 nanos:71839256 > firing_alerts:3742573726221535510 > expires_at:<seconds:1610440691 nanos:71839256 > "
level=debug ts=2021-01-07T08:38:40.041Z caller=dispatch.go:473 component=dispatcher aggrGroup="{}:{alertname=\"DemoAlert\"}" msg=flushing alerts=[DemoAlert[8a07086][active]]
level=debug ts=2021-01-07T08:38:53.616Z caller=cluster.go:306 component=cluster memberlist="2021/01/07 08:38:53 [DEBUG] memberlist: Initiating push/pull sync with: 01EVDY7T8KDZ7SG98H27F3EG9Y 172.20.20.10:9094\n"
level=debug ts=2021-01-07T08:39:00.038Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert=DemoAlert[8a07086][resolved]
level=debug ts=2021-01-07T08:39:00.227Z caller=nflog.go:540 component=nflog msg="gossiping new entry" entry="entry:<group_key:\"{}:{alertname=\\\"DemoAlert\\\"}\" receiver:<group_name:\"web.hook\" integration:\"webhook\" > timestamp:<seconds:1610008740 nanos:188592422 > resolved_alerts:3742573726221535510 > expires_at:<seconds:1610440740 nanos:188592422 > "
level=debug ts=2021-01-07T08:39:10.594Z caller=notify.go:685 component=dispatcher receiver=web.hook integration=webhook[0] msg="Notify success" attempts=1
level=debug ts=2021-01-07T08:39:10.594Z caller=dispatch.go:473 component=dispatcher aggrGroup="{}:{alertname=\"DemoAlert\"}" msg=flushing alerts=[DemoAlert[8a07086][resolved]]
level=debug ts=2021-01-07T08:39:10.747Z caller=nflog.go:540 component=nflog msg="gossiping new entry" entry="entry:<group_key:\"{}:{alertname=\\\"DemoAlert\\\"}\" receiver:<group_name:\"web.hook\" integration:\"webhook\" > timestamp:<seconds:1610008750 nanos:618452284 > firing_alerts:3742573726221535510 > expires_at:<seconds:1610440750 nanos:618452284 > "
level=debug ts=2021-01-07T08:39:25.948Z caller=nflog.go:540 component=nflog msg="gossiping new entry" entry="entry:<group_key:\"{}:{alertname=\\\"DemoAlert\\\"}\" receiver:<group_name:\"web.hook\" integration:\"webhook\" > timestamp:<seconds:1610008765 nanos:822504049 > resolved_alerts:3742573726221535510 > expires_at:<seconds:1610440765 nanos:822504049 > "
level=debug ts=2021-01-07T08:39:43.192Z caller=cluster.go:306 component=cluster memberlist="2021/01/07 08:39:43 [DEBUG] memberlist: Stream connection from=172.20.20.12:55972\n"
level=debug ts=2021-01-07T08:39:53.619Z caller=cluster.go:306 component=cluster memberlist="2021/01/07 08:39:53 [DEBUG] memberlist: Initiating push/pull sync with: 01EVDY6D0YSSM8DM6G8Z8DMF95 172.20.20.12:9094\n"
level=debug ts=2021-01-07T08:40:00.042Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert=DemoAlert[8a07086][resolved]
level=debug ts=2021-01-07T08:40:00.042Z caller=dispatch.go:473 component=dispatcher aggrGroup="{}:{alertname=\"DemoAlert\"}" msg=flushing alerts=[DemoAlert[8a07086][resolved]]
level=debug ts=2021-01-07T08:40:53.623Z caller=cluster.go:306 component=cluster memberlist="2021/01/07 08:40:53 [DEBUG] memberlist: Initiating push/pull sync with: 01EVDY6D0YSSM8DM6G8Z8DMF95 172.20.20.12:9094\n"
level=debug ts=2021-01-07T08:41:00.048Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert=DemoAlert[8a07086][resolved]
level=debug ts=2021-01-07T08:41:00.048Z caller=dispatch.go:473 component=dispatcher aggrGroup="{}:{alertname=\"DemoAlert\"}" msg=flushing alerts=[DemoAlert[8a07086][resolved]]
level=debug ts=2021-01-07T08:41:43.198Z caller=cluster.go:306 component=cluster memberlist="2021/01/07 08:41:43 [DEBUG] memberlist: Stream connection from=172.20.20.12:55976\n"
level=debug ts=2021-01-07T08:41:53.628Z caller=cluster.go:306 component=cluster memberlist="2021/01/07 08:41:53 [DEBUG] memberlist: Initiating push/pull sync with: 01EVDY6D0YSSM8DM6G8Z8DMF95 172.20.20.12:9094\n"
level=debug ts=2021-01-07T08:42:00.040Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert=DemoAlert[8a07086][resolved]
level=debug ts=2021-01-07T08:42:00.041Z caller=dispatch.go:473 component=dispatcher aggrGroup="{}:{alertname=\"DemoAlert\"}" msg=flushing alerts=[DemoAlert[8a07086][resolved]]

Alertmanager node '172.20.20.12:9090' = n3

vagrant@n1:~$ tail -f alertmanager.log
level=debug ts=2021-01-07T08:35:43.180Z caller=cluster.go:306 component=cluster memberlist="2021/01/07 08:35:43 [DEBUG] memberlist: Stream connection from=172.20.20.12:56578\n"
level=debug ts=2021-01-07T08:35:45.350Z caller=cluster.go:306 component=cluster memberlist="2021/01/07 08:35:45 [DEBUG] memberlist: Failed to join 172.20.20.11: dial tcp 172.20.20.11:9094: connect: connection refused\n"
level=debug ts=2021-01-07T08:35:45.351Z caller=cluster.go:408 component=cluster msg=reconnect result=failure peer= addr=172.20.20.11:9094 err="1 error occurred:\n\t* Failed to join 172.20.20.11: dial tcp 172.20.20.11:9094: connect: connection refused\n\n"
level=debug ts=2021-01-07T08:35:50.354Z caller=cluster.go:306 component=cluster memberlist="2021/01/07 08:35:50 [DEBUG] memberlist: Failed to join 172.20.20.11: dial tcp 172.20.20.11:9094: connect: connection refused\n"
level=warn ts=2021-01-07T08:35:50.354Z caller=cluster.go:438 component=cluster msg=refresh result=failure addr=172.20.20.11:9094 err="1 error occurred:\n\t* Failed to join 172.20.20.11: dial tcp 172.20.20.11:9094: connect: connection refused\n\n"
level=debug ts=2021-01-07T08:35:51.860Z caller=cluster.go:306 component=cluster memberlist="2021/01/07 08:35:51 [DEBUG] memberlist: Stream connection from=172.20.20.11:42642\n"
level=debug ts=2021-01-07T08:35:51.860Z caller=delegate.go:230 component=cluster received=NotifyJoin node=01EVDZSRDH74YHZ0RPDC3TRE38 addr=172.20.20.11:9094
level=debug ts=2021-01-07T08:35:51.860Z caller=cluster.go:470 component=cluster msg="peer rejoined" peer=01EVDZSRDH74YHZ0RPDC3TRE38
level=debug ts=2021-01-07T08:36:02.051Z caller=cluster.go:306 component=cluster memberlist="2021/01/07 08:36:02 [DEBUG] memberlist: Initiating push/pull sync with: 01EVDY6D0YSSM8DM6G8Z8DMF95 172.20.20.12:9094\n"
level=debug ts=2021-01-07T08:36:43.183Z caller=cluster.go:306 component=cluster memberlist="2021/01/07 08:36:43 [DEBUG] memberlist: Stream connection from=172.20.20.12:56588\n"
level=debug ts=2021-01-07T08:36:53.608Z caller=cluster.go:306 component=cluster memberlist="2021/01/07 08:36:53 [DEBUG] memberlist: Stream connection from=172.20.20.11:42648\n"
level=debug ts=2021-01-07T08:37:02.057Z caller=cluster.go:306 component=cluster memberlist="2021/01/07 08:37:02 [DEBUG] memberlist: Initiating push/pull sync with: 01EVDZSRDH74YHZ0RPDC3TRE38 172.20.20.11:9094\n"
level=debug ts=2021-01-07T08:38:00.038Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert=DemoAlert[8a07086][active]
level=debug ts=2021-01-07T08:38:00.042Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert=DemoAlert[8a07086][active]
level=debug ts=2021-01-07T08:38:02.061Z caller=cluster.go:306 component=cluster memberlist="2021/01/07 08:38:02 [DEBUG] memberlist: Initiating push/pull sync with: 01EVDZSRDH74YHZ0RPDC3TRE38 172.20.20.11:9094\n"
level=debug ts=2021-01-07T08:38:10.038Z caller=dispatch.go:473 component=dispatcher aggrGroup="{}:{alertname=\"DemoAlert\"}" msg=flushing alerts=[DemoAlert[8a07086][active]]
level=debug ts=2021-01-07T08:38:11.230Z caller=nflog.go:540 component=nflog msg="gossiping new entry" entry="entry:<group_key:\"{}:{alertname=\\\"DemoAlert\\\"}\" receiver:<group_name:\"web.hook\" integration:\"webhook\" > timestamp:<seconds:1610008691 nanos:71839256 > firing_alerts:3742573726221535510 > expires_at:<seconds:1610440691 nanos:71839256 > "
level=debug ts=2021-01-07T08:38:25.039Z caller=dispatch.go:473 component=dispatcher aggrGroup="{}:{alertname=\"DemoAlert\"}" msg=flushing alerts=[DemoAlert[8a07086][active]]
level=debug ts=2021-01-07T08:38:35.348Z caller=silence.go:350 component=silences msg="Running maintenance"
level=debug ts=2021-01-07T08:38:35.350Z caller=nflog.go:336 component=nflog msg="Running maintenance"
level=debug ts=2021-01-07T08:38:35.353Z caller=silence.go:352 component=silences msg="Maintenance done" duration=4.748617ms size=226
level=debug ts=2021-01-07T08:38:35.353Z caller=nflog.go:338 component=nflog msg="Maintenance done" duration=3.004791ms size=280
level=debug ts=2021-01-07T08:38:40.041Z caller=dispatch.go:473 component=dispatcher aggrGroup="{}:{alertname=\"DemoAlert\"}" msg=flushing alerts=[DemoAlert[8a07086][active]]
level=debug ts=2021-01-07T08:38:43.188Z caller=cluster.go:306 component=cluster memberlist="2021/01/07 08:38:43 [DEBUG] memberlist: Stream connection from=172.20.20.12:56606\n"
level=debug ts=2021-01-07T08:38:53.617Z caller=cluster.go:306 component=cluster memberlist="2021/01/07 08:38:53 [DEBUG] memberlist: Stream connection from=172.20.20.11:42656\n"
level=debug ts=2021-01-07T08:38:55.045Z caller=dispatch.go:473 component=dispatcher aggrGroup="{}:{alertname=\"DemoAlert\"}" msg=flushing alerts=[DemoAlert[8a07086][active]]
level=debug ts=2021-01-07T08:39:00.038Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert=DemoAlert[8a07086][resolved]
level=debug ts=2021-01-07T08:39:00.227Z caller=nflog.go:540 component=nflog msg="gossiping new entry" entry="entry:<group_key:\"{}:{alertname=\\\"DemoAlert\\\"}\" receiver:<group_name:\"web.hook\" integration:\"webhook\" > timestamp:<seconds:1610008740 nanos:188592422 > resolved_alerts:3742573726221535510 > expires_at:<seconds:1610440740 nanos:188592422 > "
level=debug ts=2021-01-07T08:39:02.071Z caller=cluster.go:306 component=cluster memberlist="2021/01/07 08:39:02 [DEBUG] memberlist: Initiating push/pull sync with: 01EVDY6D0YSSM8DM6G8Z8DMF95 172.20.20.12:9094\n"
level=debug ts=2021-01-07T08:39:10.618Z caller=notify.go:685 component=dispatcher receiver=web.hook integration=webhook[0] msg="Notify success" attempts=1
level=debug ts=2021-01-07T08:39:10.618Z caller=dispatch.go:473 component=dispatcher aggrGroup="{}:{alertname=\"DemoAlert\"}" msg=flushing alerts=[DemoAlert[8a07086][resolved]]
level=debug ts=2021-01-07T08:39:25.822Z caller=notify.go:685 component=dispatcher receiver=web.hook integration=webhook[0] msg="Notify success" attempts=1
level=debug ts=2021-01-07T08:40:00.042Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert=DemoAlert[8a07086][resolved]
level=debug ts=2021-01-07T08:40:00.044Z caller=dispatch.go:473 component=dispatcher aggrGroup="{}:{alertname=\"DemoAlert\"}" msg=flushing alerts=[DemoAlert[8a07086][resolved]]
level=debug ts=2021-01-07T08:40:02.079Z caller=cluster.go:306 component=cluster memberlist="2021/01/07 08:40:02 [DEBUG] memberlist: Initiating push/pull sync with: 01EVDY6D0YSSM8DM6G8Z8DMF95 172.20.20.12:9094\n"

WebHooksEvents from Pipedream Alert Notification 2021-01-07T09:38:11.059+01:00 from "http://n3:9093"

{
"method": "POST",
"path": "/",
"query": {},
"headers": {
"x-forwarded-for": "77.189.75.252",
"x-forwarded-proto": "https",
"x-forwarded-port": "443",
"host": "a94434b3d427cb85c35eb4ad0be45e6d.m.pipedream.net",
"x-amzn-trace-id": "Root=1-5ff6c872-7587b24945882b802dcbf45c",
"content-length": "670",
"content-type": "application/json",
"user-agent": "Alertmanager/0.21.0"
},
"bodyRaw": "{\"receiver\":\"web\\\\.hook\",\"status\":\"firing\",\"alerts\":[{\"status\":\"firing\",\"labels\":{\"alertname\":\"DemoAlert\",\"server\":\"ni3\",\"severity\":\"critical\"},\"annotations\":{\"summary\":\"This is a demo\"},\"startsAt\":\"2021-01-07T08:38:00.032813093Z\",\"endsAt\":\"0001-01-01T00:00:00Z\",\"generatorURL\":\"http://n3:9090/graph?g0.expr=sum%28prometheus_build_info%29+%3C+3\\u0026g0.tab=1\",\"fingerprint\":\"8a07086b74f3ba78\"}],\"groupLabels\":{\"alertname\":\"DemoAlert\"},\"commonLabels\":{\"alertname\":\"DemoAlert\",\"server\":\"ni3\",\"severity\":\"critical\"},\"commonAnnotations\":{\"summary\":\"This is a demo\"},\"externalURL\":\"http://n3:9093\",\"version\":\"4\",\"groupKey\":\"{}:{alertname=\\\"DemoAlert\\\"}\",\"truncatedAlerts\":0}\n",
"body": {
"receiver": "web\\.hook",
"status": "firing",
"alerts": [
  {
    "status": "firing",
    "labels": {
      "alertname": "DemoAlert",
      "server": "ni3",
      "severity": "critical"
    },
    "annotations": {
      "summary": "This is a demo"
    },
    "startsAt": "2021-01-07T08:38:00.032813093Z",
    "endsAt": "0001-01-01T00:00:00Z",
    "generatorURL": "http://n3:9090/graph?g0.expr=sum%28prometheus_build_info%29+%3C+3&g0.tab=1",
    "fingerprint": "8a07086b74f3ba78"
  }
],
"groupLabels": {
  "alertname": "DemoAlert"
},
"commonLabels": {
  "alertname": "DemoAlert",
  "server": "ni3",
  "severity": "critical"
},
"commonAnnotations": {
  "summary": "This is a demo"
},
"externalURL": "http://n3:9093",
"version": "4",
"groupKey": "{}:{alertname=\"DemoAlert\"}",
"truncatedAlerts": 0
}
}

Resolved Notification 2021-01-07T09:39:00.175+01:00 from "http://n3:9093"

{
  "method": "POST",
  "path": "/",
  "query": {},
  "headers": {
    "x-forwarded-for": "77.189.75.252",
    "x-forwarded-proto": "https",
    "x-forwarded-port": "443",
    "host": "a94434b3d427cb85c35eb4ad0be45e6d.m.pipedream.net",
    "x-amzn-trace-id": "Root=1-5ff6c8a4-15435eaa45d9ca1179d08f32",
    "content-length": "684",
    "content-type": "application/json",
    "user-agent": "Alertmanager/0.21.0"
  },
  "bodyRaw": "{\"receiver\":\"web\\\\.hook\",\"status\":\"resolved\",\"alerts\":[{\"status\":\"resolved\",\"labels\":{\"alertname\":\"DemoAlert\",\"server\":\"ni3\",\"severity\":\"critical\"},\"annotations\":{\"summary\":\"This is a demo\"},\"startsAt\":\"2021-01-07T08:38:00.032813093Z\",\"endsAt\":\"2021-01-07T08:39:00.032813093Z\",\"generatorURL\":\"http://n3:9090/graph?g0.expr=sum%28prometheus_build_info%29+%3C+3\\u0026g0.tab=1\",\"fingerprint\":\"8a07086b74f3ba78\"}],\"groupLabels\":{\"alertname\":\"DemoAlert\"},\"commonLabels\":{\"alertname\":\"DemoAlert\",\"server\":\"ni3\",\"severity\":\"critical\"},\"commonAnnotations\":{\"summary\":\"This is a demo\"},\"externalURL\":\"http://n3:9093\",\"version\":\"4\",\"groupKey\":\"{}:{alertname=\\\"DemoAlert\\\"}\",\"truncatedAlerts\":0}\n",
  "body": {
    "receiver": "web\\.hook",
    "status": "resolved",
    "alerts": [
      {
        "status": "resolved",
        "labels": {
          "alertname": "DemoAlert",
          "server": "ni3",
          "severity": "critical"
        },
        "annotations": {
          "summary": "This is a demo"
        },
        "startsAt": "2021-01-07T08:38:00.032813093Z",
        "endsAt": "2021-01-07T08:39:00.032813093Z",
        "generatorURL": "http://n3:9090/graph?g0.expr=sum%28prometheus_build_info%29+%3C+3&g0.tab=1",
        "fingerprint": "8a07086b74f3ba78"
      }
    ],
    "groupLabels": {
      "alertname": "DemoAlert"
    },
    "commonLabels": {
      "alertname": "DemoAlert",
      "server": "ni3",
      "severity": "critical"
    },
    "commonAnnotations": {
      "summary": "This is a demo"
    },
    "externalURL": "http://n3:9093",
    "version": "4",
    "groupKey": "{}:{alertname=\"DemoAlert\"}",
    "truncatedAlerts": 0
  }
}

Alert Notification 2021-01-07T09:39:10.572+01:00 from "http://n2:9093"

{
  "method": "POST",
  "path": "/",
  "query": {},
  "headers": {
    "x-forwarded-for": "77.189.75.252",
    "x-forwarded-proto": "https",
    "x-forwarded-port": "443",
    "host": "a94434b3d427cb85c35eb4ad0be45e6d.m.pipedream.net",
    "x-amzn-trace-id": "Root=1-5ff6c8ae-55d448e5443cb64b4d61e6f5",
    "content-length": "670",
    "content-type": "application/json",
    "user-agent": "Alertmanager/0.21.0"
  },
  "bodyRaw": "{\"receiver\":\"web\\\\.hook\",\"status\":\"firing\",\"alerts\":[{\"status\":\"firing\",\"labels\":{\"alertname\":\"DemoAlert\",\"server\":\"ni3\",\"severity\":\"critical\"},\"annotations\":{\"summary\":\"This is a demo\"},\"startsAt\":\"2021-01-07T08:38:00.032813093Z\",\"endsAt\":\"0001-01-01T00:00:00Z\",\"generatorURL\":\"http://n3:9090/graph?g0.expr=sum%28prometheus_build_info%29+%3C+3\\u0026g0.tab=1\",\"fingerprint\":\"8a07086b74f3ba78\"}],\"groupLabels\":{\"alertname\":\"DemoAlert\"},\"commonLabels\":{\"alertname\":\"DemoAlert\",\"server\":\"ni3\",\"severity\":\"critical\"},\"commonAnnotations\":{\"summary\":\"This is a demo\"},\"externalURL\":\"http://n2:9093\",\"version\":\"4\",\"groupKey\":\"{}:{alertname=\\\"DemoAlert\\\"}\",\"truncatedAlerts\":0}\n",
  "body": {
    "receiver": "web\\.hook",
    "status": "firing",
    "alerts": [
      {
        "status": "firing",
        "labels": {
          "alertname": "DemoAlert",
          "server": "ni3",
          "severity": "critical"
        },
        "annotations": {
          "summary": "This is a demo"
        },
        "startsAt": "2021-01-07T08:38:00.032813093Z",
        "endsAt": "0001-01-01T00:00:00Z",
        "generatorURL": "http://n3:9090/graph?g0.expr=sum%28prometheus_build_info%29+%3C+3&g0.tab=1",
        "fingerprint": "8a07086b74f3ba78"
      }
    ],
    "groupLabels": {
      "alertname": "DemoAlert"
    },
    "commonLabels": {
      "alertname": "DemoAlert",
      "server": "ni3",
      "severity": "critical"
    },
    "commonAnnotations": {
      "summary": "This is a demo"
    },
    "externalURL": "http://n2:9093",
    "version": "4",
    "groupKey": "{}:{alertname=\"DemoAlert\"}",
    "truncatedAlerts": 0
  }
}

Alert Notification 2021-01-07T09:39:10.627+01:00 from "http://n1:9093"

{
  "method": "POST",
  "path": "/",
  "query": {},
  "headers": {
    "x-forwarded-for": "77.189.75.252",
    "x-forwarded-proto": "https",
    "x-forwarded-port": "443",
    "host": "a94434b3d427cb85c35eb4ad0be45e6d.m.pipedream.net",
    "x-amzn-trace-id": "Root=1-5ff6c8ae-738a893c0e0f9723325be3d4",
    "content-length": "670",
    "user-agent": "Alertmanager/0.21.0",
    "content-type": "application/json"
  },
  "bodyRaw": "{\"receiver\":\"web\\\\.hook\",\"status\":\"firing\",\"alerts\":[{\"status\":\"firing\",\"labels\":{\"alertname\":\"DemoAlert\",\"server\":\"ni3\",\"severity\":\"critical\"},\"annotations\":{\"summary\":\"This is a demo\"},\"startsAt\":\"2021-01-07T08:38:00.032813093Z\",\"endsAt\":\"0001-01-01T00:00:00Z\",\"generatorURL\":\"http://n3:9090/graph?g0.expr=sum%28prometheus_build_info%29+%3C+3\\u0026g0.tab=1\",\"fingerprint\":\"8a07086b74f3ba78\"}],\"groupLabels\":{\"alertname\":\"DemoAlert\"},\"commonLabels\":{\"alertname\":\"DemoAlert\",\"server\":\"ni3\",\"severity\":\"critical\"},\"commonAnnotations\":{\"summary\":\"This is a demo\"},\"externalURL\":\"http://n1:9093\",\"version\":\"4\",\"groupKey\":\"{}:{alertname=\\\"DemoAlert\\\"}\",\"truncatedAlerts\":0}\n",
  "body": {
    "receiver": "web\\.hook",
    "status": "firing",
    "alerts": [
      {
        "status": "firing",
        "labels": {
          "alertname": "DemoAlert",
          "server": "ni3",
          "severity": "critical"
        },
        "annotations": {
          "summary": "This is a demo"
        },
        "startsAt": "2021-01-07T08:38:00.032813093Z",
        "endsAt": "0001-01-01T00:00:00Z",
        "generatorURL": "http://n3:9090/graph?g0.expr=sum%28prometheus_build_info%29+%3C+3&g0.tab=1",
        "fingerprint": "8a07086b74f3ba78"
      }
    ],
    "groupLabels": {
      "alertname": "DemoAlert"
    },
    "commonLabels": {
      "alertname": "DemoAlert",
      "server": "ni3",
      "severity": "critical"
    },
    "commonAnnotations": {
      "summary": "This is a demo"
    },
    "externalURL": "http://n1:9093",
    "version": "4",
    "groupKey": "{}:{alertname=\"DemoAlert\"}",
    "truncatedAlerts": 0
  }
}

Resolve Notification 2021-01-07T09:39:25.801+01:00 from "http://n1:9093"

{
  "method": "POST",
  "path": "/",
  "query": {},
  "headers": {
    "x-forwarded-for": "77.189.75.252",
    "x-forwarded-proto": "https",
    "x-forwarded-port": "443",
    "host": "a94434b3d427cb85c35eb4ad0be45e6d.m.pipedream.net",
    "x-amzn-trace-id": "Root=1-5ff6c8bd-640673fd07455c703cc94875",
    "content-length": "684",
    "content-type": "application/json",
    "user-agent": "Alertmanager/0.21.0"
  },
  "bodyRaw": "{\"receiver\":\"web\\\\.hook\",\"status\":\"resolved\",\"alerts\":[{\"status\":\"resolved\",\"labels\":{\"alertname\":\"DemoAlert\",\"server\":\"ni3\",\"severity\":\"critical\"},\"annotations\":{\"summary\":\"This is a demo\"},\"startsAt\":\"2021-01-07T08:38:00.032813093Z\",\"endsAt\":\"2021-01-07T08:39:00.032813093Z\",\"generatorURL\":\"http://n3:9090/graph?g0.expr=sum%28prometheus_build_info%29+%3C+3\\u0026g0.tab=1\",\"fingerprint\":\"8a07086b74f3ba78\"}],\"groupLabels\":{\"alertname\":\"DemoAlert\"},\"commonLabels\":{\"alertname\":\"DemoAlert\",\"server\":\"ni3\",\"severity\":\"critical\"},\"commonAnnotations\":{\"summary\":\"This is a demo\"},\"externalURL\":\"http://n1:9093\",\"version\":\"4\",\"groupKey\":\"{}:{alertname=\\\"DemoAlert\\\"}\",\"truncatedAlerts\":0}\n",
  "body": {
    "receiver": "web\\.hook",
    "status": "resolved",
    "alerts": [
      {
        "status": "resolved",
        "labels": {
          "alertname": "DemoAlert",
          "server": "ni3",
          "severity": "critical"
        },
        "annotations": {
          "summary": "This is a demo"
        },
        "startsAt": "2021-01-07T08:38:00.032813093Z",
        "endsAt": "2021-01-07T08:39:00.032813093Z",
        "generatorURL": "http://n3:9090/graph?g0.expr=sum%28prometheus_build_info%29+%3C+3&g0.tab=1",
        "fingerprint": "8a07086b74f3ba78"
      }
    ],
    "groupLabels": {
      "alertname": "DemoAlert"
    },
    "commonLabels": {
      "alertname": "DemoAlert",
      "server": "ni3",
      "severity": "critical"
    },
    "commonAnnotations": {
      "summary": "This is a demo"
    },
    "externalURL": "http://n1:9093",
    "version": "4",
    "groupKey": "{}:{alertname=\"DemoAlert\"}",
    "truncatedAlerts": 0
  }
}

simonswine commented 3 years ago

This is something that I have experienced before. I think it has to do with the ratio between group_wait and group_interval configured. Also external factors like network latency between your alertmanagers can play a role.

I had some good experience with setting the group_interval to at least 4 times the value of group_wait.

Let me know if that solves those immediate issues and if you still experience resends I would suggest to raise it even further. Meanwhile I will have a try to understand the related code sections and see if there is more that can be done.

joernkleinbub commented 3 years ago

I had some good experience with setting the group_interval to at least 4 times the value of group_wait."

Thanks. That solves the problem. Now, I get the alert and the resolve notification only once.

The Alertmanager config is now looking like that:

...
route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 40s
  repeat_interval: 1h
  receiver: 'web.hook'
...

simonpasquier commented 3 years ago

Thanks for the detailed report. Yes having short group_interval values is prone to notifications being resent after they resolve. This is because of how the clustering works: every Alertmanager instance will delay the group processing based on its position in the cluster. In practice, the first instance will start immediatly, the second instance will wait 115 seconds, the third instance will wait 215 seconds, and so on.

You can see it from the n3 logs:

level=debug ts=2021-01-07T08:38:55.045Z caller=dispatch.go:473 component=dispatcher aggrGroup="{}:{alertname=\"DemoAlert\"}" msg=flushing alerts=[DemoAlert[8a07086][active]]
level=debug ts=2021-01-07T08:39:00.038Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert=DemoAlert[8a07086][resolved]
level=debug ts=2021-01-07T08:39:00.227Z caller=nflog.go:540 component=nflog msg="gossiping new entry" entry="entry:<group_key:\"{}:{alertname=\\\"DemoAlert\\\"}\" receiver:<group_name:\"web.hook\" integration:\"webhook\" > timestamp:<seconds:1610008740 nanos:188592422 > resolved_alerts:3742573726221535510 > expires_at:<seconds:1610440740 nanos:188592422 > "
level=debug ts=2021-01-07T08:39:10.618Z caller=notify.go:685 component=dispatcher receiver=web.hook integration=webhook[0] msg="Notify success" attempts=1

At 08:38:55045 Alertmanager flushes the aggregation group while the alert is still active.
At 08:39:00.038 it gets the resolved alert.
At 08:39:00.227 it gets the log from the other Alertmanager instance that already sent the resolved notification.
At 08:39:10.618 it sends the firing notification that was initiated at 08:38:55045.

The 15 seconds value comes from the --cluster.peer-timeout argument and you could tweak it but at the risk of getting even more duplicated notifications if your receiver or the replication of notificaiton logs between instances lags behind.

I keep the issue open, it's something that Alertmanager should at least be able to surface to the users.

simonswine commented 3 years ago

@simonpasquier that makes a lot of sense. I spent quite a bit of time understanding what's going on and this is roughly matches with the understanding I got.

I was thinking one possible way of avoiding this "peer-timeout delayed notifications" to generate duplicates would be if we would keep look into more nflog.Entries instead of only the last one per groupKey (https://github.com/prometheus/alertmanager/blob/e6824a31100bd59308ad769a71371274455c0914/nflog/nflog.go#L450 ), as part of the DedupStage notification step.

I think if we would consider looking at the latest nflog.Entries per alertHash per groupKey, we should be able to compare if an alert has been resolved in the meantime.

Do you think this could be a sensible approach?

simonpasquier commented 3 years ago

I would need to put more thoughts into this but the untangible principle is that the recipient is guaranteed to get the eventual state of the notification. To that respect, the current trade-off is indeed that the remote system can receive conflicting notifications sometimes.

prometheus / alertmanager

Alertmanager of other nodes resends same alerts again after resolve notification #2447