prometheus / alertmanager

Prometheus Alertmanager
https://prometheus.io
Apache License 2.0
6.58k stars 2.14k forks source link

Send resolved notification for silenced alerts #226

Open fabxc opened 8 years ago

fabxc commented 8 years ago

Alerts should be marked as resolved as they are silenced so a respective resolved notification is sent. This removes the need to manually resolve these in PagerDuty and friends.

We should probably provide information that tells whether it was a an actual resolve or a resolve-via-silence.

This puts another dimension to the problem of pre-computing silences (i.e. not silence at notification time) in a sane way.

PMDubuc commented 3 years ago

What if it's changed so a silence can optionally be expired when the issue is resolved instead of after a set time? In this case the resolved notification is sent, otherwise it's not sent until the silence expires on its own (or it's not sent at all when the silence expires on its own if that's the way it works now). Would that cover each scenario in a consistent way that also avoids confusion?

aranair commented 3 years ago

That's a general problem with treating resolved notifications as meaning the alert as resolved, which is already incorrect today due to inhibitions and how group_interval works. If you're trying to get a 100% complete log of firing alerts (as distinct from notificatinos), the webhook is not a good way to do that.

Do you have any suggestions as to how that (complete log of firing alerts, not notifs) might be achieved? Silences/inhibits can be disabled altogether, but then I think there would be no real way to ignore certain alerts (not notifs), except maybe to build that logic downstream.

isavcic commented 3 years ago

Silence should be tied to the current state of the alert and not to the alert, IMO.

In several alerting platforms, "ack" means "don't alert me on this until the state changes" and that is something, I believe, a majority of people on this issue want to achieve. The only case when this is backfiring is when the alert state flaps, but, as @fabxc pointed out years ago, "frequent state change of an alert can be prevented by using a fitting alerting expression", so that's a non-issue.

@brian-brazil by not at least providing an option to remove silence on state change you're essentially forcing your view on how the monitoring should be done on a lot of users and other monitoring platform authors who don't necessarily agree with you and/or have a long time existing, proven-and-true platforms which can't easily integrate with Alertmanager just because of your inflexible stance on this. I mean, just look at the guy who runs two Alertmanagers serially in order to fix his problem...

jutley commented 3 years ago

Silence should be tied to the current state of the alert and not to the alert, IMO.

I think this is getting at the core of the issue. My team consistently finds ourselves having to manually resolve alerts that are no longer firing simply because Alertmanager's silences are way too simple. In our case, we would like silences to prevent notifications when new alerts come up, but allow notifications when alerts transition to resolved. If silences were augmented to allow us (either the silence creator or Alertmanager operator) to select specific state transitions, this would completely solve the problem for us.

I cannot count the number of times I have had to explain to our engineers why they are still being alerted on something that has resolved. After the explanation, there is always a shared consensus that Alertmanager is doing it wrong.

tom0010 commented 3 years ago

I actually built a small work around for this for the new grafana alertmanager with a webserver:

https://github.com/grafana/grafana/discussions/39615

But I agree with @isavcic , should be a users desire to decide how the alerts are managed.
And it seems that there are a lot of people wanting this feature.

andrew-phi commented 2 years ago

Having automatic "end-by-resolve" ack/silences would be a most desireable feature indeed! Also, is there any way of adding a notification at the end of silence period? Because, for example, some very strange or serious alerts are firing, people are panicking, investigation starts only to find out it's an ended silence from a month ago.

tulequ commented 2 years ago

my workaround: use custom webhook server to send alert & handle silent

jutley commented 2 years ago

@tulequ Would you mind sharing a little more detail about your workaround? Since in this scenario the alert is silenced, I would imagine that your webhook is never called from Alertmanager.

hw4liu commented 2 years ago

I have a idea, https://github.com/prometheus/alertmanager/pull/2811

sthaha commented 2 years ago

Hi Everyone,

This topic has been long standing and like @roidelapluie mentioned earlier, the original request has digressed into multiple feature requests. However I was wondering if we could reach a consensus on a subset of the requirements (excuse my naivety :-), I am new to this community ) .

What I could gather from all of the discussions above is that most of us seem to agree/ wish that the AlertManager should send a resolved notification for alerts that are inhibited.

A production use-case is documented in https://github.com/prometheus/alertmanager/issues/2754 which demonstrates an issue we face commonly

Use case In the below flow, there is theoretical inhibition that says > when **`alert2`** is firing, **`alert1`** should be suppressed `NOTE: Prom = Prometheus | AM = Alertmanager | PD= PagerDuty` 1. Prom: `alert1` fires - AM gets `alert1`, routes to PD - PD receives `alert1` 1. Prom: `alert2` fires - AM gets `alert2` - AM **`inhibits`** `alert1` - AM routes `alert2` to PD - PD receives `alert2` - PD now has two alerts - `alert1` - `alert2` 1. Prom: `alert1` resolves - AM resolves `alert1` - PD receives no notification as the alert is suppressed - PD: `alert1` becomes **orphaned** 1. Prom: `alert2` resolves - AM resolves `alert2` - PD receives `resolved` notification for `alert2` - PD resolves `alert2` - **PD retains `alert1` that is now orphaned**

I wonder if the maintainers (@simonpasquier, @w0rm, @roidelapluie ) are open to allowing this Usecase to be addressed by adding a config notify_resolved_for_inhibitted_alerts (e.g.) which if enabled would send the resolved notification, there by preserving the current behaviour as default.

matejzero commented 2 years ago

I think that is one case, another one is "nagios like" alerting, where Alertmanager would send resolved, even if an alert is silenced. At the moment, alerts that are silenced and resolved in the silence timeframe, don't get resolved in PagerDuty and you get orphaned alerts.

sgametrio commented 1 year ago

Friendly bump for the PR #3034 :)

Is there anyone around that could review/suggest a different approach? It would be SO helpful to us too! 🙏

typeid commented 1 year ago

We have two setups and are lacking the feature of resolving existing alerts for inhibited alerts in one of them:

Decentralized observability (one prometheus & alertmanager per k8s cluster)

To silence a cluster, we are using a custom operator that deletes the alertmanager's receiver (the pagerduty service). After the silence is removed, the service is re-created.

Centralized observability (one alertmanager and centralized prometheus for the whole "fleet" of clusters)

To silence a cluster, we create an "inhibition alert", effectively inhibiting all alerts matching the cluster's identifier. This inhibition alert is filtered out in PagerDuty by severity to not page.

Our issue with this is that alerts created before the inhibition are still present in PagerDuty, causing confusion and clutter.

@tulequ Would you mind sharing a little more detail about your workaround? Since in this scenario the alert is silenced, I would imagine that your webhook is never called from *Alertmanager.

I believe we are currently using a similar setup to get rid of alerts created on PagerDuty before the inhibition: we are using a PagerDuty webhook triggering when an "inhibition alert" is received. This webhook makes an API call to one of our services to resolve all active alerts that would be inhibited. Without the feature requested here, it seems like adding your own automation to clean up after alertmanager is necessary to deal with the problem.

uberspot commented 3 months ago

Friendly bump for the PR #3034 :)

Is there anyone around that could review/suggest a different approach? It would be SO helpful to us too! 🙏

Could someone review this PR for a potential fix? 🙏 We're running into the same issue, fixed there.