flapping detection and alert surpression

raintank / worldping-api

Worldping Backend Service

Other

25 stars 18 forks source link

flapping detection and alert surpression #24

Open woodsaj opened 8 years ago

woodsaj commented 8 years ago

Issue by Dieterbe Wednesday Sep 09, 2015 at 13:05 GMT Originally opened as https://github.com/raintank/grafana/issues/458

1) people can shoot themselves in the foot: any alerting setting that is just around the "sweet spot" (henceforth referred to as "sour spot") of what your data dances around, can cause a lot of alert notifications. basically constantly flipping between critical and ok because at each point in time the data changed enough to be considered as critical or ok. ("flapping" as per nagios) our current default settings put a lot of people right in the sour spot, but even with adjusted defaults, the problem is there.

2) people's typical data might be outside of the sour spot, but in case they're having a service degradation, the amount of additional failures might be just the right amount that their data goes into the sour spot, and they still become flap victims. (for example they configured alert if 6 out 20 collectors return errors, and they normally always have <2 errors at a same time, but a service degradation brings them to 5~7 erroring collectors)

3) frankly, in case our collectors suffer some subtle issues that we can't easily detect or timely remediate, they might also contribute to the user's data going into the sour spot.

in all these cases, we can't just keep sending emails to people.

nagios famously has the "flap detection": it surpresses alerts if it considers the service to have been flapping during a window of the last 21 points (see https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/3/en/flapping.html), for our case, this window is actually pretty short.
other systems I believe have a hard limit on max emails per hour/day.

I think we should do something similar, but also send them an email when we start surpressing notifications with an explanation of what is happening, what we did, and that they might want to change their settings

woodsaj commented 8 years ago

Comment by Dieterbe Wednesday Sep 09, 2015 at 13:22 GMT

some thoughts/requirements/ideas

treat flapping as a critical state, or at least warning so that if we just decided not to send any more alerts, but then the situation gets really bad, the user doesn't miss out on the news he should look into his stuff (ideally though, if the situation gets really a lot worse, we should still send a notification and say "now your stuff is really bad", but this goes beyond the logic of our current alerting rules)
timely alerts when an issue surfaces, even when it has been flapping before.
not too many notifications in a row.
an example: an alert rule becomes critical if >= 4 collectors return errors for 3 steps in a row. we immediately switch back to ok if the rule doesn't match anymore. what if we used a different condition to switch back to ok? like, only mark as ok if 2 or less collectors return error for 10 steps in a row.

woodsaj commented 8 years ago

Comment by Dieterbe Wednesday Sep 09, 2015 at 13:43 GMT

we could of course also do like bosun, and treat an "incident" as open until someone explicitly closes an alert. and never send alerts for already open alerts. so we would send a notification the first time it goes from ok to critical, and even if it changes back to ok and back to critical a gazillion times and a week passes, don't send any new alerts. expect users to follow up, fix any issues, change settings if necessary, and close the issue when they are done.

pro:

we send not a single notification more than we should

con:

requires a lot of discipline from user, to do the right action for every single alert, they can't just delete emails or ignore them. otherwise they will miss out on future outage notifications.
requires notification delivery to be solid.

i wish we could make the cons not so bad because i really like the bosun approach. maybe everytime they login, we force them to deal with any open incidents, tell them to fix their settings if needed (and we could build UI's to help them with that) and then close the incident.

woodsaj commented 8 years ago

Comment by woodsaj Wednesday Sep 09, 2015 at 14:10 GMT

alert notifications are hard. Rather then trying to solve this ourselves, i think it is best that we prioritize adding support for directing alerts to PagerDuty. PagerDuty will then handle the incident tracking and should not send additional alerts when something is already broken.

woodsaj commented 8 years ago

Comment by mattttt Wednesday Sep 09, 2015 at 14:16 GMT

This is all really great. As you've touched on, email is an inherently flawed notification method - as it'll never actually be representative of current state, and notifications/incidents/etc are hard. That said, we do have to handle a part of this ourselves, as we cant count on every Grafana alerting customer to have pager duty.

I've been liking the idea of digests more and more, and letting the user define the frequency of the digest. We can also use those emails to let them increase/decrease the frequency of the digests through a couple links.

Im thinking something like this:

Subject: Alert Notifications for endpoint_name

For the last 10 minutes:	-- :green_heart: 9:56am State changed to OK
-- :broken_heart: 9:55am State changed to Critical.
-- :green_heart: 9:52am State changed to OK
-- :broken_heart: 9:50am State changed to Critical.

[Get digests every 5m: :rabbit2:] [Get digests ever 30m :turtle:]

And those links can be some sort of hash that will update their digest frequency, without making them log in.

Thoughts?

woodsaj commented 8 years ago

Comment by mattttt Wednesday Sep 09, 2015 at 14:26 GMT

@woodsaj If we're going to rely heavily on Pagerduty or other similar services, should we be focusing more on our integrations with them?

woodsaj commented 8 years ago

Comment by ctdk Wednesday Sep 09, 2015 at 14:29 GMT

@woodsaj I've done some experimentation with sending alerts through PagerDuty already, but while it works I've learned so far that it does require some finesse; if you just have the alerts emailed to PagerDuty you get an alert stating that everything's A-OK that you need to acknowledge. This can be dealt with, of course, but a tighter integration is certainly going to be better.

woodsaj commented 8 years ago

Comment by nopzor1200 Wednesday Sep 09, 2015 at 14:30 GMT

I agree 110% with @woodsaj

I feel strongly that we should NOT try to handle rollups, sumarries, acknowlegement, etc etc. Instead, we should focus on being the emitter of high quality alerts (whether via email, pagerduty, sms, slack, etc etc). I know we are experiencing pain in making sure these alerts are indeed high quality (alert logic, default settings, monitoring collectors better, etc), but we need to solve that problem without scope creep.

My opinion on this is colored after talking about this for many many many hours with @woodsaj and people in general.

We have to be really good at emitting events, and people at scale can/will use something like Pagerduty (doesnt have to be Pagerduty). People who are not at scale will use SMS or email directly.

The other stuff (eg. on-call handling, escalation, incident acknowledgement, alert roll-up and summarization) is a really detailed and involved part of "monitoring" to solve, and I think it's key that we not dramatically expand our scope, and end up with a half assed attempt.

Doubly so because the world is moving to specialized systems that aggregate alerts from disparate systems (this is part of the reason behind Pagerduty success). Unless Grafana is going to tackle that head on, and get good at receiving and managing alerts from 3rd party systems (which we have no real plans for), I think this is a bad path to go down.

woodsaj commented 8 years ago

Comment by Dieterbe Tuesday Sep 15, 2015 at 17:14 GMT

so basically we are saying if you don't use a service such as PD, i.e. you use a "simple" alert destination such as email/text/slack/... you will get spammed as soon as soon as your data dances around what your threshold is. because yes we can make sure that people end up with thresholds that are out of the typical error range of their data but case 2 and 3 (mostly 2) is unavoidable and sooner or later will lead to a flood of alerts via email, text, slack ,... and this is a problem we don't want to solve.

I don't think this necessarily has to come with rollups, summaries, acknowledgements, etc. There is a range of solutions (or I should say alleviations) from very simplistic in functionality and development effort (but not very accurate) to more advanced. and I think we should at least explore and consider these options just a little bit. for example:

per-org limit of notifications per day (very simple, not very effective)
per addressee, per hour limit of notifications (should be 0.5~1 day of work)

woodsaj commented 8 years ago

Comment by woodsaj Wednesday Sep 16, 2015 at 04:31 GMT

yes that is pretty much what we are saying.

There is a range of solutions (or I should say alleviations) from very simplistic in functionality and development effort (but not very accurate) to more advanced.

This is our point exactly, we could probably slap something simple together, but it wouldnt address many of the complex issues with alerting and may actually introduce new isses for certain cases. We want to avoid delivering solutions that are not very good and we dont have the resources to tackle an advanced solution.

woodsaj commented 8 years ago

Comment by nopzor1200 Friday Sep 18, 2015 at 17:04 GMT

I think this topic can be split into three points, in order of priority:

1) Make Litmus "not noisy" for 99% use case by default

Dieter asks "if you don't use a service such as PD, i.e. you use a "simple" alert destination such as email/text/slack/... you will get spammed as soon as soon as your data dances around what your threshold is."

My answer would be sort of. You'd get spammed as soon as there was either a real issue with your site, or a major issue on the Internet. That's OK. The question is can Litmus really get to that point? I believe yes.

@woodsaj do you agree? @Dieterbe ? @mattttt ?

If the answer to that is no, then my point falls apart. That being said..

I really think that our default offering can operate without false alerts (unlike now), with two things (a) sensible defaults (b) autodisabling probes (c) potential other logic changes. We already know about (a) and some changes recently happened. I think that alone will go a long way. Also important is (b), we need to make progress there.

I don't disagree with your overall point, and like where your thought process is going, though. Which brings me to...

2) Potentially put in some 'antispam' controls, whereby we recognize we are directly notifying humans in a lot of cases, and we can do "nice things" for them to not melt their phones.

There could be some coarse "make sure we dont blow it" anti-spam controls that would be worth considering.

Ideas like rate limits of alerts per endpoint per hour, alert supression, I think they are worth considering, as long as (a) they're not at the expense of 1) and (b) the scope is just about "controlling the number and quality of our alerts", and not getting into how we "interact with, route, ack, roll-up, etc" (Pagerduty territory). This could be exposes to the user as a "dont spam the crap out me" option of sorts, maybe.

If that's the case I'm 100% for thinking how to address this point. I've had some similar conversations with Matt too.

Personally I worry about some bad code or a bad deploy or something causing alerts to go over for everyone, especially if we solve 1)

3) Our own alert management capabilities

We are far from having good integration with existing players that can "consume these alerts" (eg. Pagerduty, Opsgenie, even things like Bosun(?)). I think we should prioritize these integrations.

woodsaj commented 8 years ago

Comment by Dieterbe Monday Sep 21, 2015 at 07:03 GMT

the question is can Litmus really get to that point? I believe yes @woodsaj do you agree? @Dieterbe ? @mattttt ?

yes

I really think that our default offering can operate without false alerts (...)

another feature that should be implemented to make this happen, is default thresholds should be made in consideration of the data. because sure we may decide that "5 collectors" (or whatever) is a good default threshold, and it'll work for most people because they typically have significantly less errors, but people adding websites that aren't performing that well to begin with (i.e. for whom +- 5 erroring collectors is standard mode of operation) will get spammed by default and would be better served by a different default threshold. this is a minor point on the side because i think this won't be common, but i think it'll happen for some people with not-so-good hosting providers. this of course raises the point: it's only a default, you're supposed to change the settings, but how can you know which settings are appropriate before seeing the data? this makes me think perhaps we should redo the workflow, first complete the data gathering settings, then get some data, only then enable and configure alerting where we can prefill the thresholds with settings that we recommend.

Also important is (b) [auto-disabling probes], we need to make progress there.

Ideas like rate limits of alerts per endpoint per hour, alert supression, I think they are worth considering, as long as (a) they're not at the expense of 1)

isn't this a given? i have to warp my brain a lot to try to come up with an alerts rate limiting system that would somehow cause litmus to be more noisy (as your point 1 was litmus non-noisy 99% of time by default?)

Personally I worry about some bad code or a bad deploy or something causing alerts to go over for everyone, especially if we solve 1)

yes me too. it's common for SaaS vendors to accidentally send a shitload of emails, followed by an apology. in our case, it would even be worse of course: we might be waking people up.

i think we can alleviate this by being diligent about our dev/staging environment, and running a lot of load on it, ideally we copy all accounts/settings from prod to dev (but with replaced email addresses), that would let us spot email floods in dev. we could go as far as building an auto-pause emails when the volume abrubtly increases, requiring operator intervention, but that runs the risk of the manual intervention to be too time consuming, making the alerts become moot once they are allowed to be sent out.