sensu-plugins / sensu-plugins-opsgenie

Sensu plugins for OpsGenie
http://sensu-plugins.io
MIT License
4 stars 24 forks source link

Ability to override priority #28

Open majormoses opened 6 years ago

majormoses commented 6 years ago

This was created to discuss #27 by @Castaglia at a high level on the methodology before jumping into specific code.

Currently the handler creates incidents/events/alerts (whatever the right terminology for opsgenie is) in the default Priority which is P3. Ideally we should be able to change that based on various factors.

When I consider an incidents urgency there are usually several factors at play and as such I think we should outline how we would like this to work as there are different schools of thought on this matter. To best illustrate how I think we should proceed I pose the following scenarios and ask for feedback.

We should consider overrides beyond at the check level to include the handler and client level config. When supplied more than one location (such as check + handler) who's config takes precendence. To better illustrate I pose the following scenarios:

Environments

Not all environments are equal I might want to ensure that non production environments have a low default priority and a higher default in production.

Checks

Not all checks have equal priority, a caching service failing and my database service failing are two very different things.

Some clients are more important than others

My database nodes and application nodes very differently in how I respond to incidents. A disk at 90% on an app node that does not write much to disk and the same percentage on my database node are different.

My thoughts

I think given the above I think we want to create 2 types of priorities:

Why default priorities and the precedence proposed

Well to allow an admin to say default in production I want things to be P4 but allow overriding (higher|lower) on per client or check basis. In this case the most specific is the check definition so it wins that one.

Why override priorities and the precedence proposed

In non production environments you might say nothing is ever above a P4 as that might mean drop whatever you are doing and this prevents configuration on a check level being overridden as you specified this at the handler level.

Feedback

This was just my initial reaction to the feature and look forward to hearing what others have to say on this.

Castaglia commented 6 years ago

This is very much in line with my thinking on this topic as well. The default values (if needed) are configured for the handler, and thus apply to any/all OpsGenie alerts/interactions. The Sensu check configuration can then supply any configurations that override the handler values. And then the check command itself can also supply the configurations.

And indeed, this is what we see happening (intended, even?), here.

The "missing" bits, in my mind, are the lookup and use of these various fields, when making the requests to OpsGenie's APIs, via the #post_to_opsgenie method. In the case of event priority, there was no existing code to look for a "priority" value is any of the configurations (handler, check, command). I wanted my priority-setting PR to be scoped to just that field; if this approach is acceptable, I expect to provide follow-on PRs for the other fields of interest for my needs.

And indeed, the driving use case for my needs is the fact that some of my development environment checks require high priority treatment, and some of the production environment checks should be using low priority treatment (and currently are not). We can work around this by creating separate "teams" in OpsGenie, each with different escalation rules and separate API keys -- but that is working around limitations in this gem's implementation; I wanted to address that root cause, if possible.

majormoses commented 6 years ago

I think we are mostly in agreement but from your comment I do think one nuance was not conveyed or understood properly. When I mean default and override options I mean a (priority|default_priority) which could be defined at the handler, check, client, and check level. While the override_priority key which could only be specified at the handler and client level. Aside from the fact that this is how we would avoid a breaking change I think this really enables the following two philosophies to coexist:

Does that make anymore sense? Keep in mind that while in many orgs the people writing sensu checks are the same people as the admins that is not the case in all orgs. I know quite a few large places that basically offer sensu as a service and they might install say a deb/rpm to install checks created by someone else.

Castaglia commented 6 years ago

Well, I understand that use case, but I am not sure how necessary it is, to address that, right now -- as opposed to iterating on a smaller change. Adding multiple levels of fallbacks and defaults leads to things like this, which I would very very much like to avoid, if possible.

majormoses commented 6 years ago

OK, I was actually thinking of a simplified version of that, we can skip for now as long as we do it in backwards compatible way. Honestly while chef's attribute precedence is somewhat complicated it is the reason they are so powerful, every attribute precedence has a real use. I have used every level of attributes and am grateful of them because it allows me to remain flexible. I do think ours would look a lot simpler but still would be a bit complex but is because alerting is complex. When I have some time tonight I will review the code, thanks for talking this through with me.