iloire / watchmen

A simple node.js service monitor
MIT License
941 stars 195 forks source link

Only notify on sustained outages #41

Closed renderful closed 9 years ago

renderful commented 9 years ago

I recently setup WatchMen to monitor ~150 websites and web applications. With the default settings it is very chatty. 50-100 emails per hour. I modified the intervals and timeout to be more forgiving, but I am still getting about 3 outage emails per hour. When I go to the sites that are reported as being down, the sites are not down. I believe that my definition of an outage, is different than what WatchMen considers an outage.

I only want to know about sustained outages, and it seems that WatchMen does not have this capability at present. 1 timeout or connection reset is not indicative of a real outage, but multiple failed pings in sequence would be.

I've looked into the codebase, and it seems like 1 way to do this would be to create a new event called sustained-outage, with a matching callback onSustainedOutage. Then create a storage method which could tell us how many failed pings have occurred in a row. sustained-outage would be emitted when a configured number of pings failed. Then I'd modify the SES notification plugin to only send its new outage email during the onSustainedOutage callback.

Is there a better way? Am I missing the existence of this feature in WatchMen as it is today?

iloire commented 9 years ago

Hi @renderful ,

I think you are on the right track. Some additional input:

The new storage method increaseOutageFailureCount could increase the failure count and return the current one. Also, maybe there is no need for a new event or modify the notifications plugin.

This is some quick untested probably buggy code:

        if (!outage) {

          /**
           * First failure
           */

          var outageData = {
            timestamp: timestamp,
            error: error
          };

          storage.startOutage(service, outageData, function (err) {
            if (err) {
              return callback(err);
            }

            if (!service.failureThreshold) {
              self.emit('new-outage', service, outageData);
            }

            callback(null, service.failureInterval);
          });

        } else {

          /**
           * Not the first ping failure for this outage
           */

          storage.increaseOutageFailureCount(service, function (err, currentFailureCount) {
            if (err) {
              return callback(err);
            }

            if (!service.failureThreshold) {
              self.emit('current-outage', service, outage);
            } else {
              if (currentFailureCount === service.failureThreshold) {
                self.emit('new-outage', service, outage);
              } else if (currentFailureCount > service.failureThreshold) {
                self.emit('current-outage', service, outage);
              }
            }
            callback(null, service.failureInterval);
          });
        }

In the storage, increaseOutageFailureCount can call redis INCR over a key. archiveCurrentOutageIfExists should delete (or reset) that key.

Hope it helps. I will be more than happy to review your PR and answer any other questions.

EDIT: probably a better approach is to count failure pings and only creating an outage record when failureCount > servicer.failureThreshold. The new-outage event would be triggered in the same fashion when the outage is created. It would be up to use to use the first or the Nth's failure as the outage's timestamp.

iloire commented 9 years ago

This has being fixed and released for 3.1 https://github.com/iloire/watchmen/commit/14924ac91cd943d4e01f0c730bd71928f18d2a94

renderful commented 9 years ago

Very nice! Thank you. I will test this out and look through the code tonight.