prometheus / alertmanager

Prometheus Alertmanager
https://prometheus.io
Apache License 2.0
6.59k stars 2.15k forks source link

Slack Notifications Include Stale Grouped Alerts #1986

Open jnadler opened 5 years ago

jnadler commented 5 years ago

What did you do? Running 11 node exporters, 3 prometheus instances scraping all 11, and 1 alertmanger all locally in Docker using the latest published images.

Trigger a grouped alert by taking down 7 node exporters, wait for the Slack alert to arrive, resolve some of the members of the group.

What did you expect to see? After group_interval has passed, another Slack alert with the updated, smaller set of grouped alerts.

What did you see instead? Under which circumstances? After group_interval, another Slack alert including all grouped alerts, even the resolved ones. The AlertManager UI shows the correct (smaller, now that some are resolved) set of grouped alerts.

This behavior is easily reproducible but not 100% consistent - possibly racy. The Slack alert accumulates group members when new alerts are added to the group (they are added reliably) but alerts rarely leave the group (the AM UI is always correct - just the Slack alert retains the resolved alerts).

While building a simple repro environment for this issue I think I did occasionally see an alert removed from the Slack alert list, but it's certainly more common for it to be retained in the Slack message.

Environment

alerts.yml

groups:
  - name: example
    rules:
      - alert: NodeIsDown
        expr: up == 0
        for: 2m
        labels:
          severity: debug
          team: eng-observability
        annotations:
          summary: node {{ $labels.instance }}
simonpasquier commented 5 years ago

The notification data includes both firing and resolved alerts. If you want the Slack message to only display the firing ones, you could do: {{ range .Alerts.Firing }}...{{ end }}

jnadler commented 5 years ago

Wow, thanks! I studied the docs before filing this issue and couldn't find this. Might it be helpful if it were doc'd here? https://prometheus.io/docs/alerting/notifications/

simonpasquier commented 5 years ago

You're right. The source is at https://github.com/prometheus/docs/blob/master/content/docs/alerting/notifications.md

jnadler commented 5 years ago

How's this https://github.com/prometheus/docs/pull/1411

jnadler commented 5 years ago

I've confirmed (with 3 HA AlertManagers) that intermittently some alerts do disappear from the .Alerts collection, and thus from Slack alerts based on it.
Screen Shot 2019-08-02 at 1 58 33 PM

simonpasquier commented 5 years ago

I suppose that the alert for docker.for.mac.localhost:9111 got resolved? In this case, AlertManager sees that the alert group has changed and it will trigger a new notification at the next group evaluation.

jnadler commented 5 years ago

I'm not getting that result consistently. The substantial majority of the time, when the alert for 9111 gets resolved, at the next group evaluation that alert is still in the .Alerts list.

Intermittently I'm able to trigger behavior where the alert for 9111 is resolved and on the next grouping it's removed from the .Alerts list, as in the example above. I don't have a strong feeling as to what the behavior should be, but it should be reliable and consistent.

simonpasquier commented 5 years ago

Ah I thought that your last screenshot displayed only the firing alerts (eg .Alerts.Firing) but IIUC it still uses .Alerts.

jnadler commented 5 years ago

That's correct, still using .Alerts. I can probably document how to reproduce this if that's helpful - it's just a bunch of scripts that start docker containers. Not a super high priority now that I know about .Alerts.Firing.