sensu-plugins / sensu-plugin

A framework for writing Sensu plugins & handlers with Ruby.
http://sensuapp.org
MIT License
126 stars 117 forks source link

exit unknown on exception or bad args #120

Closed fessyfoo closed 7 years ago

fessyfoo commented 8 years ago

when a plugin throws an unexpected exception we should exit(3) unknown. As it means the plugin was unable to signal warning or critical itself meaning the state of the thing being checked is unknown. This commit changes the behavior from exit(2) critical which generally results in someone getting paged with a false alarm.

Similarly, when someone deploys a check and has given it the wrong arguments we should also exit(3) unknown. We don't know the state of the thing we are attempting to check. We only know the check was misconfigured because the options are wrong.

cwjohnston commented 8 years ago

Hi @fessyfoo, thanks for the pull request and the explanation of your reasoning.

I'd initially tagged this for consideration in the v1.4.0 release, but I'm starting to think that if we accept this change it will constitute a breaking change to our public API. I'm removing this PR from the 1.4.0 milestone, but I think it would be good to hear from other folks as to whether this is something we want for 2.0.0.

cc @sensu-plugins/commit-bit

mattyjones commented 8 years ago

@fessyfoo I concur with the idea of an unknown if we have an exception, but not on a config error as that is a know issue.

In our stuff we have these exit codes for a reason. It allows us to be more descriptive with exits and then tailor monitoring to filter on specific codes, so anything getting a runtime issue goes to developer foo, anything with a critical error goes to on-call bar, etc.

// MonitoringErrorCodes provides a standard set of error codes to use.
// Please use the below codes instead of random non-zero so that monitoring can
// utilize existing maps for alerting and help avoid unnecessary noise.
var MonitoringErrorCodes = map[string]int{
    "GENERALGOLANGERROR": 129, // internal script error
    "CONFIGERROR":        127, // unix config error, not enough parms, etc
    "PERMISSIONERROR":    126, // not executable, etc
    "RUNTIMEERROR":       42,  // self explantory
    "DEBUG":              37,  // You had the Alliance on you, criminals and savages… half the people on this
    // ship have been shot or wounded, including yourself, and you’re harboring known fugitives.
    "UNKNOWN":  3, // Would it save you a lot of time if I just gave up and went mad now?
    "CRITICAL": 2, // “The ships hung in the sky in much the same way that bricks don't.”
    "WARNING":  1, // this kinda sucks but don't get out of bed to deal with it
    "OK":       0, // “We’re still flying”

}

The point is the 12N range are std nix convention and while not everything lives in nix land this seems to be a good path to follow. We don't need to get this complicated by no means but it does help to work towards using something more than 0-3 which IMHO has gone the way of WindowsNT ;)

fessyfoo commented 8 years ago

@mattyjones I believe that in addition to agreeing with me that an unhandled exception should exit(3) unknown, you're also in agreement with me that invalid arguments should not exit(1) warning

I think you further suggest that incorrectly configured arguments should get it's own special exit code. Since the sensu plugin docs suggest that any other exit code is also treated as unknown I'm not against that. however I don't think there's any standard to adhere too unless the sensu team wants to define one. until them I'd leave it exit(3) unknown

furthermore the nagios plugin api only specifies exit codes 0, 1, 2, 3 and I believe other exit codes result in nagios return code of NNN is out of bounds we likely don't care about backwards compatibility with nagios though?

fessyfoo commented 8 years ago

@mattyjones on a side note. your code above uses 127 as config error? this is the exit code that most posix shells will return for command not found. I wouldn't intentionally exit(127).

Ending a script with exit 127 would certainly cause confusion when troubleshooting (is the error code a "command not found" or a user-defined one?).

http://tldp.org/LDP/abs/html/exitcodes.html

fessyfoo commented 8 years ago

@cwjohnston thanks for the response. I agree this subtly alters behavior. I'd be interested to know if anyone had any issue with it.