sensu-plugins / sensu-plugin

A framework for writing Sensu plugins & handlers with Ruby.
http://sensuapp.org
MIT License
126 stars 117 forks source link

False alarms due to the API being temporarily down #165

Closed CVTJNII closed 7 years ago

CVTJNII commented 7 years ago

I've noticed in both our test and production environments that the filters will leak alerts if the API is unavailable for any reason. While this is a good "fail-safe" default it is a nuisance as the API can blip for a number of reasons, most notably it being restarted for some reason or if it needs to reconnect to redis. This is due to the stash_exists? treating anything other than a 200 response as an indication that the check isn't silenced:

https://github.com/sensu-plugins/sensu-plugin/blob/ead9fd0a81ff3f099a0032f90ae6976d63a345c5/lib/sensu-handler.rb#L178-L180

I don't believe it is a safe assumption to assume the API will have 100% availability, in practice it can be temporarily impacted, especially if pieces of the Sensu stack are being distributed for HA. I have mitigated this in my environment by overloading stash_exists? and making it retry on error, up to a preconfigured timeout. So far this has squelched all the false alerts for us.

Since all this code is going away with the filter deprecation is this even worth patching? Can this happen in the new built-in filter functionality?

majormoses commented 7 years ago

@cwjohnston any idea if it can happen with the new filtering? If so I think this needs a fix otherwise we should probably just close it as there is a reasonable work around until upgrading to 2.x

cwjohnston commented 7 years ago

Native silencing doesn't depend on the API to decide whether or not an event is silenced. Sensu Plugin 2.0 -- which disables stash based filtering by default -- was released in March and there's an effort under way to update existing plugin packages for use with that new version. We've also reimplemented the check dependency filtering in this library as an extension now built into Sensu Core.

As a result I don't think we'll be trying to address false alarms due to API availability in this library.