prometheus / alertmanager

Prometheus Alertmanager
https://prometheus.io
Apache License 2.0
6.57k stars 2.14k forks source link

PagerDuty alerts are stuck if I mute the alert after it's firing but before it's resolved #3664

Open freak12techno opened 8 months ago

freak12techno commented 8 months ago

What did you do?

  1. Have an Alertmanager setup that has PagerDuty as a receiver
  2. Do something that triggers an alert that is routed to PagerDuty receiver
  3. Mute an alert
  4. Fix it so it'd get resolved
  5. Alertmanager doesn't send a notification on a resolved alert as it's muted
  6. PagerDuty doesn't know it's fixed so it's stuck there unless I manually go to PagerDuty app and resolve it there
  7. Mute expires, so I end up with a case when the firing is no more firing, but PagerDuty thinks it's still firing and bugs me until I resolve it via PagerDuty app

The procedure I describe above is something I do constantly (as in, something breaks, I mute it so it won't annoy me with notifications as I know it's broken and I am fixing it, it's fixed before the mute is expired, and the alert is stuck in PagerDuty). For Telegram it seems ok, as nothing wrong will happen if I won't receive a resolve notification, but for PagerDuty (and I think for other similar services like OpsGenie, but I cannot verify that as I do not use it) it might be annoying as they might have their own set of alerting rules, like pinging a person every hour or sending them a push notification/SMS/call once the alert is firing.

What did you expect to see? Alert is resolved in PagerDuty.

What did you see instead? Under which circumstances? Alert is not resolved and is stuck, although is fixed.

Environment

alertmanager, version 0.26.0 (branch: HEAD, revision: d7b4f0c7322e7151d6e3b1e31cbc15361e295d8d)
  build user:       root@df8d7debeef4
  build date:       20230824-11:11:58
  go version:       go1.20.7
  platform:         linux/amd64
  tags:             netgo

Pretty sure it's not relevant, but:

prometheus, version 2.48.1 (branch: HEAD, revision: 63894216648f0d6be310c9d16fb48293c45c9310)
  build user:       root@71f108ff5632
  build date:       20231208-23:33:22
  go version:       go1.21.5
  platform:         linux/amd64
  tags:             netgo,builtinassets,stringlabels

route: receiver: 'telegram' group_wait: 10s group_by: ['host', 'hosting', 'datacenter', 'alertname'] repeat_interval: 1h

routes:

receivers:

templates:


* Prometheus configuration file:

insert configuration here (if relevant to the issue)


* Logs:

insert Prometheus and Alertmanager logs relevant to the issue here

TheMeier commented 6 months ago

In the provided config there is no send_resolved: true for the pagerduty receiver

freak12techno commented 6 months ago

@TheMeier yes, but it's true by default, according to https://prometheus.io/docs/alerting/latest/configuration/#pagerduty_config Additionally, I have most of the alerts resolving itself fine unless they are muted.

grobinson-grafana commented 5 months ago

This is an unfortunate side effect of how the Alertmanager works. This is a frequent complaint and fixing it requires a lot of effort.

The issue is that the Alertmanager will delete the alert from the group if:

  1. The alert is resolved.
  2. A notification for the group was just sent (even if the alert was not in that notification because it was muted).

If this happens while a mute time is still active the resolved alert is deleted. That means when the mute time ends it doesn't even know that this alert existed – for example – it could have been deleted 6 hours ago.

A possible fix for this is to read the nflog when checking if an alert can be deleted – however it does not work if the Alertmanager is restarted because alerts are kept in memory and not persisted to disk.