Open freak12techno opened 8 months ago
In the provided config there is no send_resolved: true
for the pagerduty receiver
@TheMeier yes, but it's true by default, according to https://prometheus.io/docs/alerting/latest/configuration/#pagerduty_config Additionally, I have most of the alerts resolving itself fine unless they are muted.
This is an unfortunate side effect of how the Alertmanager works. This is a frequent complaint and fixing it requires a lot of effort.
The issue is that the Alertmanager will delete the alert from the group if:
If this happens while a mute time is still active the resolved alert is deleted. That means when the mute time ends it doesn't even know that this alert existed – for example – it could have been deleted 6 hours ago.
A possible fix for this is to read the nflog when checking if an alert can be deleted – however it does not work if the Alertmanager is restarted because alerts are kept in memory and not persisted to disk.
What did you do?
The procedure I describe above is something I do constantly (as in, something breaks, I mute it so it won't annoy me with notifications as I know it's broken and I am fixing it, it's fixed before the mute is expired, and the alert is stuck in PagerDuty). For Telegram it seems ok, as nothing wrong will happen if I won't receive a resolve notification, but for PagerDuty (and I think for other similar services like OpsGenie, but I cannot verify that as I do not use it) it might be annoying as they might have their own set of alerting rules, like pinging a person every hour or sending them a push notification/SMS/call once the alert is firing.
What did you expect to see? Alert is resolved in PagerDuty.
What did you see instead? Under which circumstances? Alert is not resolved and is stuck, although is fixed.
Environment
System information: Linux 5.15.0-91-generic x86_64
Alertmanager version:
Pretty sure it's not relevant, but:
route: receiver: 'telegram' group_wait: 10s group_by: ['host', 'hosting', 'datacenter', 'alertname'] repeat_interval: 1h
routes:
receiver: 'telegram' match_re: severity: critical|warning continue: true
receiver: 'pagerduty' group_by: ['...'] match_re: severity: critical continue: true
receivers:
name: 'telegram' telegram_configs:
name: 'pagerduty' pagerduty_configs:
templates:
insert configuration here (if relevant to the issue)
insert Prometheus and Alertmanager logs relevant to the issue here