prometheus / alertmanager

Prometheus Alertmanager
https://prometheus.io
Apache License 2.0
6.64k stars 2.15k forks source link

feature request: cancel pushover notifications when the alert resolves itself #634

Open stapelberg opened 7 years ago

stapelberg commented 7 years ago

Currently, when an alert fires while my phone is in silent mode, pushover sends notification after notification, waiting for me to acknowledge them. Notably, even if the alert is resolved by the time I see the alert, I still need to acknowledge the notification. This seems unnecessarily aggressive.

https://pushover.net/api#receipt, section “Canceling Emergency-Priority Retries” describes how to cancel notifications.

I suggest alertmanager should cancel notifications when the corresponding alert is resolved.

brian-brazil commented 7 years ago

I'm not sure that that's the correct behaviour. Just because an alert is no longer firing doesn't mean the underlying issue is resolved.

stapelberg commented 7 years ago

If the underlying issue was transient in nature, there’s nothing one can do about it. Think network issues with an upstream provider. Over-alerting adds to alert fatigue, which I’d like to avoid.

I can see how different people and setups might have different approaches here. Maybe we could make the feature configurable?

brian-brazil commented 7 years ago

If the underlying issue was transient in nature, there’s nothing one can do about it. Think network issues with an upstream provider. Over-alerting adds to alert fatigue, which I’d like to avoid.

The correct solution then is to either adjust or delete the alert. A false positive alert that happens to resolve before you get to it isn't any better than one that resolves after you ack it.

stapelberg commented 7 years ago

I’ve been preaching the “adjust or delete the alert” mantra myself, so I’m familiar with the rationale. However, as soon as third-party systems (in my case: the internet) are involved, it’s often not quite as simple anymore. I don’t get any defined SLA from my ISPs, so I can’t design my alerts to only fire after their alerts have fired.

What I’m doing is best-effort oncall: I want alerts to fire soon-ish (my current threshold for most is 1 hour), but I will not get woken up by an alert that fires at 03:00 in the night. Ideally, my phone wouldn’t buzz and require acknowledgement when do-not-disturb ends at 08:00 sharp in the morning either, because starting the day in a mild panic for an issue that I can do nothing about isn’t a great experience :).

Increasing the alert hold-off time for alerts to ≥ 10 hours isn’t a good option either, as that would leave me flying blind for most of the day.

One alternative solution might be to silence all alerts in alertmanager for the duration of my night. I dislike that solution, however: suddenly I’d need to maintain the do-not-disturb status in both my phone and in alertmanager. This is a hassle when I’m traveling to different time zones, when I’m watching a movie, etc.

I hope this outlines the motivation for the feature.

brian-brazil commented 7 years ago

The feature you're requesting here is merely a workaround for the actual problem.

It sounds like the feature you're really requesting is time of day based alerting.

stapelberg commented 7 years ago

As I tried to explain, time-of-day based alerting seems like an inadequate way to solve the problem to me: I want time-based alerting to use the settings of my pager device. I consider duplicating the setup between my phone and alertmanager too much of a hassle to be practical.

If alertmanager implemented cancelling notifications, my pager setup would just do the right thing.

brian-brazil commented 7 years ago

What you're requesting is well beyond the scope of the alertmanager. Incident management with per-user notification configuration is the responsibility of something like Pagerduty.

stapelberg commented 7 years ago

Well, yes. What I’m saying is that in order for Pushover to implement my desired workflow, Pushover requires API users to not only create notifications upon alerts starting to fire, but also cancel the created notifications when alerts cease to fire.

This seems fairly simple to me from an actual implementation standpoint. Alertmanager just needs to store the Pushover notification identifier with the alert and send one additional HTTP request once the alert stops firing (it already sends a “resolved” notification in that case).

brian-brazil commented 7 years ago

Alertmanager just needs to store the Pushover notification identifier with the alert and send one additional HTTP request once the alert stops firing (it already sends a “resolved” notification in that case).

That's not going to work out technically. Such information wouldn't be synced across HA nodes, and would interact in a non-trivial way with grouping.

Neither the AM nor Pushover are really designed for this use case, and I don't think it's sane to try and make them cover it.

stapelberg commented 7 years ago

That’s a shame, but thanks for the details.

discordianfish commented 7 years ago

I came here to open the exact same issue. I fully agree with @stapelberg about the semantics and it matches what I'm used to and what I want to use pushover for. I think it's completely the right think to resolve the alert at pushover when it's resolved in the alertmanager.

discordianfish commented 7 years ago

It's also exactly what pagerduty is doing IIRC. I'm not sure how it works with grouped alerts, but I'm sure resolved prometheus alerts resolve the alerts in PD.

SuperQ commented 7 years ago

Looking over the other functionality, we do send resolved messages to PagerDuty. It seems like "Cancel" is the equivalent for Pushover. We should support this. Users can control the behavior with send_resolved.

brian-brazil commented 7 years ago

Pushover is not an escalation management system, so the semantics don't quite align. This is would be more like deleting a Slack message when the alert goes away, which doesn't at all feel right.

discordianfish commented 7 years ago

It's not an 'escalation management system' but closer to it than it's to Slack.

stapelberg commented 7 years ago

Strong +1 to what @discordianfish wrote. Pushover retains messages even after they are canceled, it will just stop demanding your attention.

brian-brazil commented 7 years ago

This still doesn't feel correct. That an event got to the stage of a paging notification at all indicates that a human needs to urgently take a look.

I don't think we should be implementing features that encourage bad alerting practices. https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit which is what we follow goes into this more.

discordianfish commented 7 years ago

@brian-brazil Then don't use send_resolved. I also disagree it's bad alerting practice. Maybe we can at least agree that there are different opinions on that and support what we can without sacrificing other goals? I'd understand your argument if this would recommend making it impossible to not resolve them.

Frankly in the end of the day three people want this feature and you're whole counter argument is that we are not knowing what we're doing/want. This isn't a healthy position for a maintainer.

stapelberg commented 7 years ago

I agree that using this feature in the wrong context encourages bad alerting practices. However, I outlined earlier in this discussion that I think I have a valid use-case for my specific context, and it seems that @discordianfish has one, too.

I’d love it if you could reconsider your position on this feature.

brian-brazil commented 7 years ago

@stapelberg My understanding of your use case is that you don't want notifications between 22:00 and 08:00, and don't want a notification that fired only during that time to continue to be notifying you at 08:00.

The feature requested here is a roundabout way to make that work in some of your particular circumstances, and is not really doing what you want as it'll also stop notifications early during the daytime.

What I believe is the correct solution here is alert routing based on time of day. Oddly despite all the previous requests we've had for this I can't find an issue for it, so I've opened #876 for tracking. I don't recall anyone expressing strong opinions around implementing it one way or the other.

Right now the way to address this would be to handle it on the Prometheus side, https://www.robustperception.io/combining-alert-conditions/ has an example.

I’d love it if you could reconsider your position on this feature.

I'm not the maintainer of this repository, so it's not ultimately my call.

The questions here are a) should send_resolved=true use the Cancel feature of Pushover? (if that's even technically possible with the current alertmanager HA design, I suspect it isn't), and b) what should the default for send_resolved be for Pushover?

The current rule as to whether send_resolved is enabled by default is if the relevant system has a notion of "resolved" incidents (e.g. email/slack/hipchat don't, it's just a 2nd notification), which may not be the right rule if e.g. PagerDuty is halting notifications on resolution.

I'd understand your argument if this would recommend making it impossible to not resolve them.

I'm not following your logic here. Can you expand?

Frankly in the end of the day three people want this feature and you're whole counter argument is that we are not knowing what we're doing/want. This isn't a healthy position for a maintainer.

I don't agree there. We have many users who are used to other systems and bring pre-conceptions with them that don't apply to Prometheus-style monitoring, or are used to unhealthy practices that lead to burnout. Many of them are vehement about burning themselves out. In order to maintain a coherent and useful project, one of the duties of a maintainer is not to blindly implement every feature requested by N users. Instead they should consider whether the request instead comes from a misunderstanding or attacking the problem from the wrong angle which can be resolved with education/advice, makes sense in some limited circumstances but would overall be a disservice to users, makes sense but would overbroaden the scope of the project, or makes sense but is too complicated.

For example if you look at all the feature requests/bug reports we get, the vast majority of them turn out to be support requests of some form.

discordianfish commented 6 years ago

So what about this? Can somebody with the authoritah finally decide if this is something that would be accepted?

discordianfish commented 6 years ago

I've looked into their API and we would need to store their 'receipt' ID and I don't think there is a way yet in the AM to do that. Addin such fuctionality just for pushover doesn't seem to be worth it, unless we have the need to store similar state for other notifiers but it doesn't look like it.

discordianfish commented 6 years ago

I've sent a feature request to pushover help, asking for a stateless way e.g by supporting a client provided identifier for the message.

jcs commented 6 years ago

Hi, Pushover developer here.

You can now add a tags parameter when creating a message which is a comma-separated list of arbitrary tags. Any of those tags can then be passed to https://api.pushover.net/1/receipts/cancel_by_tag/(your tag).json to cancel all receipts with that tag.

https://pushover.net/api/receipts#cancel_by_tag

discordianfish commented 6 years ago

Great, I'll submit a PR to add canceling to the alertmanager soon.

asymmetric commented 3 years ago

@discordianfish have you gotten around to it?

discordianfish commented 3 years ago

Nah, I killed the project for which I wanted this, so haven't had a reason to implement it.. But it might get back to it now that I'm redoing my personal infra stuff.. Should be straight forward to implement though, so if you want to take a stab at it, go for it.

onedr0p commented 1 year ago

I'd be interested in this too. 👍

lukasmalkmus commented 1 year ago

Has this been picked up by anybody? I get spammed by PushOver, especially when setting up new alertmanagers and it takes a while until I have fine tuned my Prometheus alert rules to not fire to early.

I'm also happy to contribute a PR using the cancel_by_tag API from PushOver.

onedr0p commented 1 year ago

@lukasmalkmus please feel free to implement this and open a PR 👍🏼

onedr0p commented 10 months ago

To sum up this issue, would the implementation of this feature only apply to Pushover's "Emergency Priority" Alerts? What I am thinking is if an alert with priority is 2 (Emergency) is sent and then resolves itself it should use cancel_by_tags to dismiss it, correct? I don't see a reason why having it work for anything but a 2 (Emergency) makes sense here.

Emergency-priority notifications are similar to high-priority notifications, but they are repeated until the notification is acknowledged by the user. These are designed for dispatching and on-call situations where it is critical that a notification be repeatedly shown to the user (or all users of the group that the message was sent to) until it is acknowledged. The first user in a group to acknowledge a message will cancel retries for all other users in the group.

lukasmalkmus commented 10 months ago

I'm not gonna lie: I really wanted to tackle this but went the lazy route and just changed the priority for alerts going to Pushover. This worked perfectly as expected.

This also means you are correct, there is no need to cancel on non-emergency priority as the notification is not repeated for lower level priorities.