prometheus / alertmanager

Prometheus Alertmanager
https://prometheus.io
Apache License 2.0
6.61k stars 2.14k forks source link

Alertmanager of other nodes resends same alerts again after resolve notification #2447

Open joernkleinbub opened 3 years ago

joernkleinbub commented 3 years ago

What did you do? I setted up a Prometheus and Alertmanager in HA on 3 Vagrant nodes on my local machine for testing purposes. I established a single DemoAlert "sum(prometheus_build_info) < 3" via WebHook and provoke a raise of the alert by killing one of the prometheus.

What did you expect to see? Chronological:

  1. Prometheus is killed
  2. The alert is fired.
  3. Prometheus is up again
  4. An resolved message.

What did you see instead? Under which circumstances? Chronological (Notifications and Logs please see below):

  1. Prometheus is killed
  2. An alert notification is fired by Alertmanager "http://n3:9093" 2021-01-07T09:38:11.059+01:00
  3. Prometheus is up again
  4. An resolved notification by Alertmanager "http://n3:9093": 2021-01-07T09:39:00.175+01:00
  5. The same alert notification (2.) is fired again by Alertmanger "http://n2:9093": 2021-01-07T09:39:10.572+01:00
  6. The same alert notification (2.) is fired again by Alertmanger "http://n1:9093": 2021-01-07T09:39:10.627+01:00
  7. A resolve message is fired by Alertmanger "http://n1:9093": 2021-01-07T09:39:25.801+01:00

This behaviour is reproducable generally. Alerts are resend always after the resolve notification.

Environment

Resolved Notification 2021-01-07T09:39:00.175+01:00 from "http://n3:9093"

{
  "method": "POST",
  "path": "/",
  "query": {},
  "headers": {
    "x-forwarded-for": "77.189.75.252",
    "x-forwarded-proto": "https",
    "x-forwarded-port": "443",
    "host": "a94434b3d427cb85c35eb4ad0be45e6d.m.pipedream.net",
    "x-amzn-trace-id": "Root=1-5ff6c8a4-15435eaa45d9ca1179d08f32",
    "content-length": "684",
    "content-type": "application/json",
    "user-agent": "Alertmanager/0.21.0"
  },
  "bodyRaw": "{\"receiver\":\"web\\\\.hook\",\"status\":\"resolved\",\"alerts\":[{\"status\":\"resolved\",\"labels\":{\"alertname\":\"DemoAlert\",\"server\":\"ni3\",\"severity\":\"critical\"},\"annotations\":{\"summary\":\"This is a demo\"},\"startsAt\":\"2021-01-07T08:38:00.032813093Z\",\"endsAt\":\"2021-01-07T08:39:00.032813093Z\",\"generatorURL\":\"http://n3:9090/graph?g0.expr=sum%28prometheus_build_info%29+%3C+3\\u0026g0.tab=1\",\"fingerprint\":\"8a07086b74f3ba78\"}],\"groupLabels\":{\"alertname\":\"DemoAlert\"},\"commonLabels\":{\"alertname\":\"DemoAlert\",\"server\":\"ni3\",\"severity\":\"critical\"},\"commonAnnotations\":{\"summary\":\"This is a demo\"},\"externalURL\":\"http://n3:9093\",\"version\":\"4\",\"groupKey\":\"{}:{alertname=\\\"DemoAlert\\\"}\",\"truncatedAlerts\":0}\n",
  "body": {
    "receiver": "web\\.hook",
    "status": "resolved",
    "alerts": [
      {
        "status": "resolved",
        "labels": {
          "alertname": "DemoAlert",
          "server": "ni3",
          "severity": "critical"
        },
        "annotations": {
          "summary": "This is a demo"
        },
        "startsAt": "2021-01-07T08:38:00.032813093Z",
        "endsAt": "2021-01-07T08:39:00.032813093Z",
        "generatorURL": "http://n3:9090/graph?g0.expr=sum%28prometheus_build_info%29+%3C+3&g0.tab=1",
        "fingerprint": "8a07086b74f3ba78"
      }
    ],
    "groupLabels": {
      "alertname": "DemoAlert"
    },
    "commonLabels": {
      "alertname": "DemoAlert",
      "server": "ni3",
      "severity": "critical"
    },
    "commonAnnotations": {
      "summary": "This is a demo"
    },
    "externalURL": "http://n3:9093",
    "version": "4",
    "groupKey": "{}:{alertname=\"DemoAlert\"}",
    "truncatedAlerts": 0
  }
}

Alert Notification 2021-01-07T09:39:10.572+01:00 from "http://n2:9093"

{
  "method": "POST",
  "path": "/",
  "query": {},
  "headers": {
    "x-forwarded-for": "77.189.75.252",
    "x-forwarded-proto": "https",
    "x-forwarded-port": "443",
    "host": "a94434b3d427cb85c35eb4ad0be45e6d.m.pipedream.net",
    "x-amzn-trace-id": "Root=1-5ff6c8ae-55d448e5443cb64b4d61e6f5",
    "content-length": "670",
    "content-type": "application/json",
    "user-agent": "Alertmanager/0.21.0"
  },
  "bodyRaw": "{\"receiver\":\"web\\\\.hook\",\"status\":\"firing\",\"alerts\":[{\"status\":\"firing\",\"labels\":{\"alertname\":\"DemoAlert\",\"server\":\"ni3\",\"severity\":\"critical\"},\"annotations\":{\"summary\":\"This is a demo\"},\"startsAt\":\"2021-01-07T08:38:00.032813093Z\",\"endsAt\":\"0001-01-01T00:00:00Z\",\"generatorURL\":\"http://n3:9090/graph?g0.expr=sum%28prometheus_build_info%29+%3C+3\\u0026g0.tab=1\",\"fingerprint\":\"8a07086b74f3ba78\"}],\"groupLabels\":{\"alertname\":\"DemoAlert\"},\"commonLabels\":{\"alertname\":\"DemoAlert\",\"server\":\"ni3\",\"severity\":\"critical\"},\"commonAnnotations\":{\"summary\":\"This is a demo\"},\"externalURL\":\"http://n2:9093\",\"version\":\"4\",\"groupKey\":\"{}:{alertname=\\\"DemoAlert\\\"}\",\"truncatedAlerts\":0}\n",
  "body": {
    "receiver": "web\\.hook",
    "status": "firing",
    "alerts": [
      {
        "status": "firing",
        "labels": {
          "alertname": "DemoAlert",
          "server": "ni3",
          "severity": "critical"
        },
        "annotations": {
          "summary": "This is a demo"
        },
        "startsAt": "2021-01-07T08:38:00.032813093Z",
        "endsAt": "0001-01-01T00:00:00Z",
        "generatorURL": "http://n3:9090/graph?g0.expr=sum%28prometheus_build_info%29+%3C+3&g0.tab=1",
        "fingerprint": "8a07086b74f3ba78"
      }
    ],
    "groupLabels": {
      "alertname": "DemoAlert"
    },
    "commonLabels": {
      "alertname": "DemoAlert",
      "server": "ni3",
      "severity": "critical"
    },
    "commonAnnotations": {
      "summary": "This is a demo"
    },
    "externalURL": "http://n2:9093",
    "version": "4",
    "groupKey": "{}:{alertname=\"DemoAlert\"}",
    "truncatedAlerts": 0
  }
}

Alert Notification 2021-01-07T09:39:10.627+01:00 from "http://n1:9093"

{
  "method": "POST",
  "path": "/",
  "query": {},
  "headers": {
    "x-forwarded-for": "77.189.75.252",
    "x-forwarded-proto": "https",
    "x-forwarded-port": "443",
    "host": "a94434b3d427cb85c35eb4ad0be45e6d.m.pipedream.net",
    "x-amzn-trace-id": "Root=1-5ff6c8ae-738a893c0e0f9723325be3d4",
    "content-length": "670",
    "user-agent": "Alertmanager/0.21.0",
    "content-type": "application/json"
  },
  "bodyRaw": "{\"receiver\":\"web\\\\.hook\",\"status\":\"firing\",\"alerts\":[{\"status\":\"firing\",\"labels\":{\"alertname\":\"DemoAlert\",\"server\":\"ni3\",\"severity\":\"critical\"},\"annotations\":{\"summary\":\"This is a demo\"},\"startsAt\":\"2021-01-07T08:38:00.032813093Z\",\"endsAt\":\"0001-01-01T00:00:00Z\",\"generatorURL\":\"http://n3:9090/graph?g0.expr=sum%28prometheus_build_info%29+%3C+3\\u0026g0.tab=1\",\"fingerprint\":\"8a07086b74f3ba78\"}],\"groupLabels\":{\"alertname\":\"DemoAlert\"},\"commonLabels\":{\"alertname\":\"DemoAlert\",\"server\":\"ni3\",\"severity\":\"critical\"},\"commonAnnotations\":{\"summary\":\"This is a demo\"},\"externalURL\":\"http://n1:9093\",\"version\":\"4\",\"groupKey\":\"{}:{alertname=\\\"DemoAlert\\\"}\",\"truncatedAlerts\":0}\n",
  "body": {
    "receiver": "web\\.hook",
    "status": "firing",
    "alerts": [
      {
        "status": "firing",
        "labels": {
          "alertname": "DemoAlert",
          "server": "ni3",
          "severity": "critical"
        },
        "annotations": {
          "summary": "This is a demo"
        },
        "startsAt": "2021-01-07T08:38:00.032813093Z",
        "endsAt": "0001-01-01T00:00:00Z",
        "generatorURL": "http://n3:9090/graph?g0.expr=sum%28prometheus_build_info%29+%3C+3&g0.tab=1",
        "fingerprint": "8a07086b74f3ba78"
      }
    ],
    "groupLabels": {
      "alertname": "DemoAlert"
    },
    "commonLabels": {
      "alertname": "DemoAlert",
      "server": "ni3",
      "severity": "critical"
    },
    "commonAnnotations": {
      "summary": "This is a demo"
    },
    "externalURL": "http://n1:9093",
    "version": "4",
    "groupKey": "{}:{alertname=\"DemoAlert\"}",
    "truncatedAlerts": 0
  }
}

Resolve Notification 2021-01-07T09:39:25.801+01:00 from "http://n1:9093"

{
  "method": "POST",
  "path": "/",
  "query": {},
  "headers": {
    "x-forwarded-for": "77.189.75.252",
    "x-forwarded-proto": "https",
    "x-forwarded-port": "443",
    "host": "a94434b3d427cb85c35eb4ad0be45e6d.m.pipedream.net",
    "x-amzn-trace-id": "Root=1-5ff6c8bd-640673fd07455c703cc94875",
    "content-length": "684",
    "content-type": "application/json",
    "user-agent": "Alertmanager/0.21.0"
  },
  "bodyRaw": "{\"receiver\":\"web\\\\.hook\",\"status\":\"resolved\",\"alerts\":[{\"status\":\"resolved\",\"labels\":{\"alertname\":\"DemoAlert\",\"server\":\"ni3\",\"severity\":\"critical\"},\"annotations\":{\"summary\":\"This is a demo\"},\"startsAt\":\"2021-01-07T08:38:00.032813093Z\",\"endsAt\":\"2021-01-07T08:39:00.032813093Z\",\"generatorURL\":\"http://n3:9090/graph?g0.expr=sum%28prometheus_build_info%29+%3C+3\\u0026g0.tab=1\",\"fingerprint\":\"8a07086b74f3ba78\"}],\"groupLabels\":{\"alertname\":\"DemoAlert\"},\"commonLabels\":{\"alertname\":\"DemoAlert\",\"server\":\"ni3\",\"severity\":\"critical\"},\"commonAnnotations\":{\"summary\":\"This is a demo\"},\"externalURL\":\"http://n1:9093\",\"version\":\"4\",\"groupKey\":\"{}:{alertname=\\\"DemoAlert\\\"}\",\"truncatedAlerts\":0}\n",
  "body": {
    "receiver": "web\\.hook",
    "status": "resolved",
    "alerts": [
      {
        "status": "resolved",
        "labels": {
          "alertname": "DemoAlert",
          "server": "ni3",
          "severity": "critical"
        },
        "annotations": {
          "summary": "This is a demo"
        },
        "startsAt": "2021-01-07T08:38:00.032813093Z",
        "endsAt": "2021-01-07T08:39:00.032813093Z",
        "generatorURL": "http://n3:9090/graph?g0.expr=sum%28prometheus_build_info%29+%3C+3&g0.tab=1",
        "fingerprint": "8a07086b74f3ba78"
      }
    ],
    "groupLabels": {
      "alertname": "DemoAlert"
    },
    "commonLabels": {
      "alertname": "DemoAlert",
      "server": "ni3",
      "severity": "critical"
    },
    "commonAnnotations": {
      "summary": "This is a demo"
    },
    "externalURL": "http://n1:9093",
    "version": "4",
    "groupKey": "{}:{alertname=\"DemoAlert\"}",
    "truncatedAlerts": 0
  }
}
simonswine commented 3 years ago

This is something that I have experienced before. I think it has to do with the ratio between group_wait and group_interval configured. Also external factors like network latency between your alertmanagers can play a role.

I had some good experience with setting the group_interval to at least 4 times the value of group_wait.

Let me know if that solves those immediate issues and if you still experience resends I would suggest to raise it even further. Meanwhile I will have a try to understand the related code sections and see if there is more that can be done.

joernkleinbub commented 3 years ago

I had some good experience with setting the group_interval to at least 4 times the value of group_wait."

Thanks. That solves the problem. Now, I get the alert and the resolve notification only once.

The Alertmanager config is now looking like that:

...
route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 40s
  repeat_interval: 1h
  receiver: 'web.hook'
...
simonpasquier commented 3 years ago

Thanks for the detailed report. Yes having short group_interval values is prone to notifications being resent after they resolve. This is because of how the clustering works: every Alertmanager instance will delay the group processing based on its position in the cluster. In practice, the first instance will start immediatly, the second instance will wait 115 seconds, the third instance will wait 215 seconds, and so on.

You can see it from the n3 logs:

level=debug ts=2021-01-07T08:38:55.045Z caller=dispatch.go:473 component=dispatcher aggrGroup="{}:{alertname=\"DemoAlert\"}" msg=flushing alerts=[DemoAlert[8a07086][active]]
level=debug ts=2021-01-07T08:39:00.038Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert=DemoAlert[8a07086][resolved]
level=debug ts=2021-01-07T08:39:00.227Z caller=nflog.go:540 component=nflog msg="gossiping new entry" entry="entry:<group_key:\"{}:{alertname=\\\"DemoAlert\\\"}\" receiver:<group_name:\"web.hook\" integration:\"webhook\" > timestamp:<seconds:1610008740 nanos:188592422 > resolved_alerts:3742573726221535510 > expires_at:<seconds:1610440740 nanos:188592422 > "
level=debug ts=2021-01-07T08:39:10.618Z caller=notify.go:685 component=dispatcher receiver=web.hook integration=webhook[0] msg="Notify success" attempts=1

The 15 seconds value comes from the --cluster.peer-timeout argument and you could tweak it but at the risk of getting even more duplicated notifications if your receiver or the replication of notificaiton logs between instances lags behind.

I keep the issue open, it's something that Alertmanager should at least be able to surface to the users.

simonswine commented 3 years ago

@simonpasquier that makes a lot of sense. I spent quite a bit of time understanding what's going on and this is roughly matches with the understanding I got.

I was thinking one possible way of avoiding this "peer-timeout delayed notifications" to generate duplicates would be if we would keep look into more nflog.Entries instead of only the last one per groupKey (https://github.com/prometheus/alertmanager/blob/e6824a31100bd59308ad769a71371274455c0914/nflog/nflog.go#L450 ), as part of the DedupStage notification step.

I think if we would consider looking at the latest nflog.Entries per alertHash per groupKey, we should be able to compare if an alert has been resolved in the meantime.

Do you think this could be a sensible approach?

simonpasquier commented 3 years ago

I would need to put more thoughts into this but the untangible principle is that the recipient is guaranteed to get the eventual state of the notification. To that respect, the current trade-off is indeed that the remote system can receive conflicting notifications sometimes.