sensu-plugins / sensu-plugin

A framework for writing Sensu plugins & handlers with Ruby.
http://sensuapp.org
MIT License
126 stars 117 forks source link

introduce acknowledgments #135

Closed fessyfoo closed 7 years ago

fessyfoo commented 8 years ago

short version: if a stash has type: "acknowledgment" expire_on_resolve: true then when the event is resolved delete the stash and handle the event.

I propose people need two workflows with regards to silencing alerts.

  1. downtime, the thing being checked may fail and resolve multiple times during some time interval, alerts should be filtered.
  2. acknowledgment, the thing being checked just failed, someone has acknowledged this fact and future failing alerts should be filtered, until the check is resolved. the resolution event should be handled.

right now sensu-handler only supports the notion of downtime with it's silence stashes. this has a few problems related to an acknowledgment workflow

This commit introduces acknowledgments by way of the type field in the normal /silence/* stashes if the type field is set to 'acknowledgment' and the event action is resolve then the stash will be deleted and the event processed.

a note on testing:

I wasn't sure how to test this. I did my best to imitate the current style of tests which is essentially stubbing and mocking via a subclass in an external script and signaling back expectations via stdout. mine still ended up more complicated than any existing ones.

I also tried to just use minitest and new() up a handler object making use of Sensu::Handler.disable_autorun but I got lost in the world of minitest mocks. if someone thinks that's better let me know I can try again.

An alternate abandoned approach:

I initially considered having acknowledgments as their own namespace in the stashes, such as /acknowledgment/:client/:name, however I decided that this would not have backwards compatibility with existing systems checking for silenced alerts and would add to the number of api calls made per handled event.

fessyfoo commented 8 years ago

TODO need to take this into consideration.

fessyfoo commented 8 years ago

@decklin @portertech any thoughts on this?

@bobtfish @solarkennedy I'd be interested in your opinions as well, as we're using your pagerduty handler and are trying to integrate acks from pagerduty into stashing in sensu which currently would result in pagerduty incidents not getting automatically resolved because the resolve event is filtered.

solarkennedy commented 8 years ago

Yea, we've talked about this at Yelp a bit, and somehow we have gotten away without acks all this time?

I wouldn't "mind" an ack primitive, but I would want it to be on the same playing field as native silences and not something that handlers have to "opt into".

I do agree that if there is to be an ack primitive it needs to be handled differently than a normal silence.

cwjohnston commented 8 years ago

Hi @fessyfoo,

Thanks for opening this PR, I can tell significant effort went into these changes and really appreciate the effort you put in to write tests and thoroughly describe your use case.

[...] which currently would result in pagerduty incidents not getting automatically resolved because the resolve event is filtered

I think the filter_repeated method is biting you here. The existing filter methods are being deprecated; I've blogged about this here https://sensuapp.org/blog/2016/07/07/sensu-plugin-filter-deprecation.html and have opened #136 to implement the deprecation.

Because silencing is moving into Sensu Core as a primitive, and because I believe it is in the best interests of the Sensu project to remove filtering from this library, I think adding a new feature of the existing filter_silenced method is probably the wrong thing to do right now.

fessyfoo commented 8 years ago

Thank you very much for your feedback, @cwjohnston I solidly understand the desire to move filtering out of sensu-handler. I'm all for that effort.

you are right filter_repeated looks like it would also filter out the resolution event, in some conditions. IMO It shouldn't filter resolves. #142

That's not what i'm getting bitten by though. Silencing filters out the resolve event as well

In the case of silencing, I think there are two valid use cases:

  1. downtime where filtering the resolve is desired.
  2. acknowledgment where filtering the resolve is not desired. (for the very same reasons you state in https://github.com/sensu-plugins/sensu-plugin/pull/136#issuecomment-232738538 )

The silencing implemented in sensu-handler.rb and presumably to be accommodated in https://github.com/sensu/sensu/issues/1328 filters out the resolution event. and that's the key difference between what i'm calling an acknowledgment vs downtime. In a pagerduty example, if the resolution event is silenced by a stash then you can't automatically issue a resolve to pagerduty. when the on-call intended the silence as aknowledgement not getting the resolution event is frustrating.

I think both acknowledgment and downtime forms of silence should be first class citizens. in both the old and new forms of silencing.

I've commented in https://github.com/sensu/sensu/issues/1328#issuecomment-228484884 which brings silencing into the core and thankfully so have you. I very much hope that the idea of acknowledgment can be accommodated there.

I believe it is in the best interests of the Sensu project to remove filtering from this library, I think adding a new feature of the existing filter_silenced method is probably the wrong thing to do right now

I understand that view. I agree with it.

I do feel that filter_silenced is serving as the reference implementation for things that the built in silencing will accommodate. So depending on the timeline for the release of built in silencing I would like to make one more plea for merging this. :)

fessyfoo commented 8 years ago

updated to use expire_on_resolve: true matching built in silencing's proposed implementation [ yes I know this is likely not to get merged. :D ]

fessyfoo commented 8 years ago

noting for clarity that the support for this feature as expire_on_resolve is included in the proposal https://github.com/sensu/sensu/issues/1328 and pending implementation https://github.com/sensu/sensu/pull/1400 of built in silencing.

cwjohnston commented 7 years ago

I'm closing this issue as the silencing implementation with expire_on_resolve shipped in 0.26!