mpounsett / nagiosplugin

A Python class library which helps with writing Nagios (Icinga) compatible plugins.
https://nagiosplugin.readthedocs.io/
Other
28 stars 14 forks source link

FatalContext #3

Closed mpounsett closed 2 years ago

mpounsett commented 9 years ago

Original report by Christian Kauhaus (Bitbucket: ckauhaus, GitHub: ckauhaus).


spaans@fox-it.com:

I'd like to share one small bit of code which I found myself reusing again and again, for situations in which you need want to signal that something went wrong, but cannot do that by crossing a (possibly unknown) threshold or raising an exception (because an exception might actually mean critical failure instead of unknown). This is the idiom I use:

#!python
class FatalContext(nagiosplugin.Context):
    def __init__(self, name):
        super(FatalContext, self).__init__(name)

    def evaluate(self, metric, resource):
        return nagiosplugin.Result(nagiosplugin.state.Critical,
                                   "Fatal Error: %r" % metric.value,
                                   metric
            )

...

    archive_fatal_ctx = FatalContext('fatal_archive')

...
            try:
                sess = get_session(session_id)
            except IOError as e:
                yield nagiosplugin.Metric('fatal_%s' % self.name_postfix, repr(e))
                return
...

If you find this a useful scenario as well, go ahead and put it into the nagiosplugin distribution.

mpounsett commented 2 years ago

I think any kind of exception being raised during the taking of a measurement means that the measurement couldn't be taken. By definition that can't be a critical or warning, since you don't know the result of the measurement you were trying to take.

If, for example, you want to know "is this port answering" and the answer is "no" then the connect failure exception should be caught and a False or 0 (zero) result returned. If you're trying to measure whether a daemon is returning the correct content and you're getting an exception because of a timeout, that is the very definition of UNKNOWN. I don't think these two examples should be mixed in a single test, which is the sort of thing that would lead to an unexpected exception leading you to want to return CRITICAL.

For that reason I'm going to mark this wontfix, because I don't think it's a bug that unhandled exceptions result in an UNKNOWN state.

If you've got an argument for why I'm wrong I'm willing to entertain the idea... I just can't think of a use case where I think this is a good idea.