prometheus / alertmanager

Prometheus Alertmanager
https://prometheus.io
Apache License 2.0
6.58k stars 2.14k forks source link

Send resolved notification for silenced alerts #226

Open fabxc opened 8 years ago

fabxc commented 8 years ago

Alerts should be marked as resolved as they are silenced so a respective resolved notification is sent. This removes the need to manually resolve these in PagerDuty and friends.

We should probably provide information that tells whether it was a an actual resolve or a resolve-via-silence.

This puts another dimension to the problem of pre-computing silences (i.e. not silence at notification time) in a sane way.

raypettersen commented 7 years ago

This would be very useful for teams. Hope to see this feature soon.

lswith commented 7 years ago

you could implement this with sending an acknowledge instead of a resolved notification?

jkemp101 commented 7 years ago

I think I have a slightly different use case. We often have 3 levels of alerts info, warning, critical. At critical Pager Duty gets notified. All alert levels send notifications to hipchat. I often silence an alert when I see its at the warning level so if it goes critical it doesn't notify pager duty. But then I miss the resolved hipchat message so I have to keep checking the status of the alert manually. I don't really want to silence resolves.

I wouldn't mind a different message going HC letting everyone know I silenced it, that would be useful info. But I wouldn't want a message going to HC stating the alert has be resolved when I silence it, that would be misleading.

ivan-kiselev commented 7 years ago

+1 we use our own status-board to notify external users if any problems in infrastructure occurs. We use webhooks to create events there, so, when problem occurs - we want external users to see it on status-board but sometimes we don't want to get bunch of emails (slack, etc) notifies as we already know about the problem, so we silence an alert. But if there is no resolve web-hook since alert is silenced - we see that problem is still present on our status-board when it's actually resolved. That brings some confusion and it would be nice if silenced alerts would report resolved state.

kamalmarhubi commented 6 years ago

This would be incredibly useful.

@fabxc

This puts another dimension to the problem of pre-computing silences (i.e. not silence at notification time) in a sane way.

Is there another issue or something with an explanation of what you mean here?

kamalmarhubi commented 6 years ago

@fabxc could you elaborate a bit? I'm willing to put half a day or so into this, as it'd be a great improvement to the Prometheus + Pagerduty combo. The implementation looks fairly straighforward except for uncertainty from your note above.

kamalmarhubi commented 6 years ago

@fabxc ping on the above question: how do you see this interact with precomputed silences? I would like to work on this, but would like to avoid doing something that has to be thrown away in the near future.

kamalmarhubi commented 6 years ago

Another ping here. Any other maintainers, could you route this to the right person?

PMDubuc commented 6 years ago

+1 We're using Alerta to manage alerts from Alertmanager through a web hook. There's a plugin that lets us silence alerts in Alertmanager by when they are acknowledged in Alerta. When a silenced alert clears there is no resolved notification sent to Alerta so the alert still looks like it's unresolved. If the problem recurs and and it's still acknowledged in Alerta, it's automatically silenced in Alertmanager, because the silence for the alrert remains active in Alertmanger, peventing alerts from being sent to other destinations. I think that if send_resolved is true in the webhook_configs the notification should be sent even for silenced alerts and the silence should be expired in Alertmanager.

ezraroi commented 5 years ago

any comment from the maintainers?

juliusv commented 5 years ago

I am not an Alertmanager maintainer, but the semantic meaning of a silence is that they stop notifications, but a silence does not indicate that the matched alerts are resolved (on the contrary, often times you still want to see them as unresolved, active alerts on the Alertmanager dashboard, they should just not notify anyone anymore). I'm not sure how to best marry that concept to platforms like Alerta, but just wanted to provide background information to explain why this would be conceptually problematic.

PMDubuc commented 5 years ago

I am not an Alertmanager maintainer, but the semantic meaning of a silence is that they stop notifications, but a silence does not indicate that the matched alerts are resolved (on the contrary, often times you still want to see them as unresolved, active alerts on the Alertmanager dashboard, they should just not notify anyone anymore). I'm not sure how to best marry that concept to platforms like Alerta, but just wanted to provide background information to explain why this would be conceptually problematic.

Yes, but there is no problem with silencing unresolved alerts. It's when they are resolved that silence becomes a problem.

ezraroi commented 5 years ago

Exactly, if we will still get resolved notifications everything will be fine

juliusv commented 5 years ago

@PMDubuc @ezraroi I see, yes, that is different from what the initial issue description said: "Alerts should be marked as resolved as they are silenced so a respective resolved notification is sent."... but then the discussion drifted toward sending resolved notifications when silenced alerts are actually resolved, which can make sense. In that case though, what do we do with cases where even a resolved notification pages or bugs someone, and they would be annoyed if a silence still allowed resolved notifications? Sounds like either we'd live with that possibility or introduce yet another option.

PMDubuc commented 5 years ago

@PMDubuc @ezraroi I see, yes, that is different from what the initial issue description said: "Alerts should be marked as resolved as they are silenced so a respective resolved notification is sent."... but then the discussion drifted toward sending resolved notifications when silenced alerts are actually resolved, which can make sense. In that case though, what do we do with cases where even a resolved notification pages or bugs someone, and they would be annoyed if a silence still allowed resolved notifications? Sounds like either we'd live with that possibility or introduce yet another option.

Unresolved alerts are silenced because otherwise they are sent repeatedly. Repeated alerts can be annoying while the problem is being resolved or resolution needs to be delayed. On the other hand Resolved notifications are only sent once. It's nice to know when the monitor sees the problem is resolved. The behavior I would like is the way problem acknowledgements are handled in Nagios.

satterly commented 5 years ago

@juliusv If the webhook payload includes whether or not the alert has been silenced when sending resolved notifications then it could be up to the receiver how to handle the different scenarios you describe.

@PMDubuc Can you provide a link or describe briefly how Nagios hanldes this?

The behavior I would like is the way problem acknowledgements are handled in Nagios.

juliusv commented 5 years ago

@PMDubuc

Unresolved alerts are silenced because otherwise they are sent repeatedly.

Systems like PagerDuty handle this by grouping all alerts with the same incident key (= alert grouping labels hash) without notifying about each subsequent one. This is kind of how Alertmanager expects receivers to behave. Or if they can't, then set the repeat_interval really high, so you get fewer (or practically no) repetitions.

@satterly

If the webhook payload includes whether or not the alert has been silenced when sending resolved notifications then it could be up to the receiver how to handle the different scenarios you describe.

For the webhook that'd be an option. How would that be handled for all the other more specific receiver types though? Sending resolved notifications for silenced alerts on the webhook receiver only would seem inconsistent.

PMDubuc commented 5 years ago

@PMDubuc

Unresolved alerts are silenced because otherwise they are sent repeatedly.

Systems like PagerDuty handle this by grouping all alerts with the same incident key (= alert grouping labels hash) without notifying about each subsequent one. This is kind of how Alertmanager expects receivers to behave. Or if they can't, then set the repeat_interval really high, so you get fewer (or practically no) repetitions.

Well, this is fine. Alerta handle this case too, but I don't think it can be expected of all receivers like email or others that may have a proprietary interface. If repeated notifications stop when they have not been silenced, they can be expired by the receiver. So "silencing" the alert is a form of acknowledgement that there is a problem and someone knows about it and is working on it. But, when the problem clears, how are Alertmanager receivers supposed to be notified if the resolved notification is not sent? I explained this problem in my Aug. 2nd comment above. I don't understand why silencing alerts also applies to resolve notification when send_resolved is true. I would think this would also be a problem for other receivers like PagerDuty. If no notification is sent when a problem is resolved, receivers can't update their status for the problem.

@satterly The way Nagios handles notifications for problems that are acknowledged is to silence the active problem notifications. When the problem clears, an OK status notification is sent and the acknowledgement is automatically removed. I think Alertmanager should do the same thing with silences since they are also a form of acknowledgement of an active problem. When the problem clears, the silence should be cleared also. If the problem clears, it makes no sense to silence notifications about it. A resolved notification should be sent sent (unless send_resolved is false).

brian-brazil commented 5 years ago

I would expect that anything stopping the alert from firing would result in the same effect, whether that be the alert resolving, a silence, or an inhibition.

PMDubuc commented 5 years ago

I would expect that anything stopping the alert from firing would result in the same effect, whether that be the alert resolving, a silence, or an inhibition.

As I have tried to explain, I think having an exception to this for resolved notifications makes sense and the lack of this exception presents real problems for receivers. Also when silences persist after a problem has been resolved, it can prevent a new instance of the problem from being detected.

juliusv commented 5 years ago

@PMDubuc

Also when silences persist after a problem has been resolved, it can prevent a new instance of the problem from being detected.

Silences can cover arbitary label combinations and do not have to correspond exactly to the alerts in an alert grouping. So you'd have a hard time finding a silence exactly matching that alert group in which all alerts just got resolved. Secondly, silences are frequently used to suppress notifications about flapping alerts, or in maintenance situations where stuff can be going up and down for a while, and you wouldn't want a resolved alert to remove matching silence there either.

But yeah, maybe resolve notification behavior should be changed.

ezraroi commented 5 years ago

maybe we can have property in the config the same as send_resolved for the webhook that will send information about silences of alerts. This will ease the integration of other platforms with alert manager

brian-brazil commented 5 years ago

I don't see this as something that should be configurable, and send_resolved already causes enough implementation problems.

ezraroi commented 5 years ago

OK, so @brian-brazil my suggestion is:

  1. When silenced we can send silenced event to the web hook and stop sending fire even as long as it is silenced .
  2. When alert is resolved, send resolved event to the web hook regardless the silence status

I think this stays consistent and allows platform to have all the information they need. What do you say?

brian-brazil commented 5 years ago

When alert is resolved, send resolved event to the web hook regardless the silence status

This is inconsistent, would you also send a firing event while it's silenced? If you have something that cares about silence status, I'd suggest it fetch that information from the API.

PMDubuc commented 5 years ago

I don't get it. Where is the inconsistency in sending a resolved notification for alerts that have been silenced while they were active? I've an alert is resolved, it no longer needs to be silenced. If the receiver doesn't want resolve notifications, set send_resolved to 'false'.

brian-brazil commented 5 years ago

Where is the inconsistency in sending a resolved notification for alerts that have been silenced while they were active?

There's no inconsistency for sending a notification when a firing alert is silenced, however if you want an 2nd additional notification when the alert resolves irrelevant of silencing then you would also need a firing notification irrelevant of silencing - which doesn't make sense as that defeats the purpose of silences.

PMDubuc commented 5 years ago

Where is the inconsistency in sending a resolved notification for alerts that have been silenced while they were active?

There's no inconsistency for sending a notification when a firing alert is silenced, however if you want an 2nd additional notification when the alert resolves irrelevant of silencing then you would also need a firing notification irrelevant of silencing - which doesn't make sense as that defeats the purpose of silences.

I don't think a single resolved notification defeats the purpose of silencing the repeated notifications of an active problem. But I've explained all that above and don't wish to repeat myself. I don't think that the practical problems caused by suppressing problem recovery notices have been adequately addressed. The only workaround is an error prone manual procedure which doesn't make for a reliable alerting system. Removing the silence has to be done manually (easily forgotten) to avoid missing the recovery notification and subsequent new problem notifications (alerts).

jkemp101 commented 5 years ago

I'm wondering if I would just be happy with a new option on a silence that automatically deleted a silence if it no longer matched any active alerts. If this happened early enough in alert state changes the resolved notification could go out as if the silence never existed, at least for the last alert that "cleared" the silence.

For flapping alerts I would still rely on time expiration of a silence. I'm less concerned about getting the resolved notifications for those kind of problems.

kien-truong commented 4 years ago

I don't think a single resolved notification defeats the purpose of silencing the repeated notifications of an active problem.

Unless your alerts are flapping, in which case you will have repeated Resolved notifications too. It's an inconsistency if there are no corresponding Firing notifications to these Resolved notifications.

Here's my 2 cents on the subject, a silencing rule should be removed if there are no matching alerts, after a idle-timeout period to avoid flapping. After the silencing rule is removed, only matching alerts that had trigger notification before, but resolved itself during the silenced period will receive a Resolved notification. Still, this would require Alert Manager to become a lot more stateful than it currently is, and would be quite complex to implement, especially in cluster mode.

seanorama commented 4 years ago

Here's my 2 cents on the subject, a silencing rule should be removed if there are no matching alerts, after a idle-timeout period to avoid flapping.

This is a good approach assuming there is still an option to set the silence to "not clear" when issue resolved as there may be cases where you want to silence it even if it clears.

PMDubuc commented 4 years ago

I don't think a single resolved notification defeats the purpose of silencing the repeated notifications of an active problem.

Unless your alerts are flapping, in which case you will have repeated Resolved notifications too. It's an inconsistency if there are no corresponding Firing notifications to these Resolved notifications.

Here's my 2 cents on the subject, a silencing rule should be removed if there are no matching alerts, after a idle-timeout period to avoid flapping. After the silencing rule is removed, only matching alerts that had trigger notification before, but resolved itself during the silenced period will receive a Resolved notification. Still, this would require Alert Manager to become a lot more stateful than it currently is, and would be quite complex to implement, especially in cluster mode.

Flapping detection and handling is a common feature in other monitoring systems. It's an inherent problem in monitoring and alerting. Finding some reasonable way to handle flapping in Alertmanger would only make it more useful.

lzh-lab commented 3 years ago

I'm wondering if we need a new status kind "silencing" reprsent silencd alerts.

aranair commented 3 years ago

+1 we use our own status-board to notify external users if any problems in infrastructure occurs. We use webhooks to create events there, so, when problem occurs - we want external users to see it on status-board but sometimes we don't want to get bunch of emails (slack, etc) notifies as we already know about the problem, so we silence an alert. But if there is no resolve web-hook since alert is silenced - we see that problem is still present on our status-board when it's actually resolved. That brings some confusion and it would be nice if silenced alerts would report resolved state.

^is the exact same issue that my team is currently facing as well. Has anyone found a solution or even a workaround?

satterly commented 3 years ago

No, and it's been almost 5 years so I wouldn't hold your breath. 😞

roidelapluie commented 3 years ago

It seems that this issue is going in all directions, and is not really actionable. It is spoken about resolved alerts, auto-deletion of silences, and many other topics. Which direction would we need to go?

kien-truong commented 3 years ago

^is the exact same issue that my team is currently facing as well. Has anyone found a solution or even a workaround?

There's a workaround of using short silence duration, but automatically extending it while you're dealing with the problem using something like kthnxbye .

It's not perfect, still you still need to resolve alerts manually, but at least you don't have to guess the duration of your silence.

aranair commented 3 years ago

^is the exact same issue that my team is currently facing as well. Has anyone found a solution or even a workaround?

There's a workaround of using short silence duration, but automatically extending it while you're dealing with the problem using something like kthnxbye .

It's not perfect, still you still need to resolve alerts manually, but at least you don't have to guess the duration of your silence.

thanks -- unfortunately my main issue is actually with the manual resolution of alerts downstream. I don't really have an issue with the duration since we're inserting/removing silences programmatically to pre-emptively disable alerts as opposed to using it as a way of acknowledgement.

aranair commented 3 years ago

I may be misinformed idk, but feel free to chime in to correct me if so:

Assuming a specific set of labels, the two ways that a downstream receiver becomes out-of-sync with the true status of alert(s) in alert-manager: 1) when a new alert fires during a silence 2) when a firing alert resolves during a silence

If the alert status flips (again) while still silenced:

In either case, the state of the alerts at the downstream receiver remains consistent with alertmanager.

With that assumption, I'll focus on what happens if it flips after the silence expires:

^^There are some differences in terms of consequence between the two scenarios. Arguably, I'll say that the consequence for 2) is worse than 1) and so perhaps it somewhat okay for the handling of firing/resolved notifications during a silence to not be absolutely consistent/symmetric. And as you can probably already tell, I am +1 for asymmetric treatment of notifications. E.g. if an alert resolves during a silence, a notification is sent but not sent if an alert fires with a matching silence.

vykulakov commented 3 years ago

Hi @aranair!

I've met the same problem and I've been using a simple workaround for a long period of time - this is using two alertmanager instances. The first instance just sends all alerts right into another alert system (status-board or something) without any grouping at all (I use "..." value for the related settings). This is used just to synchronize all alerts in both systems. The second instance groups alerts as needed and send emails and other notifications. Silences, obviously, should be created only in the second instance. That's it.

Ths solution is also described here: https://github.com/alerta/alerta/issues/278#issuecomment-491241182 (may be useful somehow).

roidelapluie commented 3 years ago

Would it be acceptable to have a parameter for receivers:

Would that fix your usecases?

brian-brazil commented 3 years ago

I think that we should be handling this consistently without any configuration, and silences should be treated the same as any other way in which an alert is no longer firing.

satterly commented 3 years ago

I think that we should be handling this consistently without any configuration, and silences should be treated the same as any other way in which an alert is no longer firing.

@brian-brazil I'm really confused by your reply. Are you suggesting that the solution here is to do nothing?

brian-brazil commented 3 years ago

No, I'm suggesting that when you silence an alert then any notifications that would previously have been firing which are now stopped would trigger resolved notifications at the next group interval - the indistinguishable from if the alert had stopped firing in Prometheus. Otherwise remote systems can't hope to maintain an accurate state of what notifications are/aren't firing.

satterly commented 3 years ago

Still confused. How is this different to what was proposed by several people 2 years ago?

brian-brazil commented 3 years ago

Which one? As @roidelapluie says this issue is going in no particular direction, and I make it around 7 different ideas presented here by various people over the past 4 years.

aranair commented 3 years ago

No, I'm suggesting that when you silence an alert then any notifications that would previously have been firing which are now stopped would trigger resolved notifications at the next group interval - the indistinguishable from if the alert had stopped firing in Prometheus. Otherwise remote systems can't hope to maintain an accurate state of what notifications are/aren't firing.

@brian-brazil - And what if the silence expires and the same alert is still firing? Is another notification for “firing” going to be sent in this case? What would the notification look like? (with the original time stamp or treated like another new alert?) —

Personally I think that a silenced-firing alert is still firing (and not resolved) and that this pseudo resolution would wrongly tell downstream systems that the alert is resolved when it is potentially not.. I still think asymmetrically only sending it when it’s actually resolved is better :/

brian-brazil commented 3 years ago

And what if the silence expires and the same alert is still firing? Is another notification for “firing” going to be sent in this case.

Yes, at the next group_interval.

What would the notification look like? (with the original time stamp or treated like another new alert?)

Notifications don't have a timestamp, so I presume you're talking about alert start time which is an implementation detail you shouldn't depend on. It might be the same value as in previous notifications, it might be a different one - same as always.

this pseudo resolution would wrongly tell downstream systems that the alert is resolved when it is potentially not.

I think you might be mixing up definitions of resolved. Resolved for the alertmanager is that the alert does not send firing notifications, it doesn't tell us anything about whether the underlying issue is resolved as that'd require a human to determine. See https://www.robustperception.io/running-into-burning-buildings-because-the-fire-alarm-stopped

aranair commented 3 years ago

I think you might be mixing up definitions of resolved. Resolved for the alertmanager is that the alert does not send firing notifications, it doesn't tell us anything about whether the underlying issue is resolved as that'd require a human to determine. See https://www.robustperception.io/running-into-burning-buildings-because-the-fire-alarm-stopped

hmm, I get where you’re coming from with your note about underlying issue and I do agree with that post wholeheartedly (even if current situation may suggest otherwise heh)

But I think I should clear up what I meant: “when this new pseudo-resolved notification is sent, the (prometheus) alert is still firing.” (e.g I was not referring to the real-world underlying issue)

With this new pseudo resolution, we basically could end up with different states between:

It would probably be okay for downstream receivers in charge of paging but for historical / alert-recoding receivers, I think this could be inaccurate and misleading. If it is the intention for AM to restrict the purpose of webhook receivers to pagers, I think it may be okay but I don’t think that is the case, right?

(Side note: Your solution does solve my very specific issue of having orphaned unresolved incidents when alerts resolve during silences)

brian-brazil commented 3 years ago

If it is the intention for AM to restrict the purpose of webhook receivers to pagers, I think it may be okay but I don’t think that is the case, right?

That's not the intention - it's meant to cover anything. I'd personally never send resolved alerts to humans in any case, they're only a distraction from potential firing notifications.

It would probably be okay for downstream receivers in charge of paging but for historical / alert-recoding receivers, I think this could be inaccurate and misleading.

That's a general problem with treating resolved notifications as meaning the alert as resolved, which is already incorrect today due to inhibitions and how group_interval works. If you're trying to get a 100% complete log of firing alerts (as distinct from notificatinos), the webhook is not a good way to do that.

Your solution does solve my very specific issue of having orphaned unresolved incidents when alerts resolve during silences

That's the basic problem I see here, we're inconsistent. The notification should have happened after the silence was put in.