Closed renderful closed 9 years ago
Hi @renderful ,
I think you are on the right track. Some additional input:
The new storage method increaseOutageFailureCount
could increase the failure count and return the current one. Also, maybe there is no need for a new event or modify the notifications plugin.
This is some quick untested probably buggy code:
if (!outage) {
/**
* First failure
*/
var outageData = {
timestamp: timestamp,
error: error
};
storage.startOutage(service, outageData, function (err) {
if (err) {
return callback(err);
}
if (!service.failureThreshold) {
self.emit('new-outage', service, outageData);
}
callback(null, service.failureInterval);
});
} else {
/**
* Not the first ping failure for this outage
*/
storage.increaseOutageFailureCount(service, function (err, currentFailureCount) {
if (err) {
return callback(err);
}
if (!service.failureThreshold) {
self.emit('current-outage', service, outage);
} else {
if (currentFailureCount === service.failureThreshold) {
self.emit('new-outage', service, outage);
} else if (currentFailureCount > service.failureThreshold) {
self.emit('current-outage', service, outage);
}
}
callback(null, service.failureInterval);
});
}
In the storage, increaseOutageFailureCount
can call redis INCR
over a key. archiveCurrentOutageIfExists
should delete (or reset) that key.
Hope it helps. I will be more than happy to review your PR and answer any other questions.
EDIT: probably a better approach is to count failure pings and only creating an outage record when failureCount > servicer.failureThreshold
. The new-outage
event would be triggered in the same fashion when the outage is created. It would be up to use to use the first or the Nth's failure as the outage's timestamp.
This has being fixed and released for 3.1 https://github.com/iloire/watchmen/commit/14924ac91cd943d4e01f0c730bd71928f18d2a94
Very nice! Thank you. I will test this out and look through the code tonight.
I recently setup WatchMen to monitor ~150 websites and web applications. With the default settings it is very chatty. 50-100 emails per hour. I modified the intervals and timeout to be more forgiving, but I am still getting about 3 outage emails per hour. When I go to the sites that are reported as being down, the sites are not down. I believe that my definition of an outage, is different than what WatchMen considers an outage.
I only want to know about sustained outages, and it seems that WatchMen does not have this capability at present. 1 timeout or connection reset is not indicative of a real outage, but multiple failed pings in sequence would be.
I've looked into the codebase, and it seems like 1 way to do this would be to create a new event called
sustained-outage
, with a matching callbackonSustainedOutage
. Then create a storage method which could tell us how many failed pings have occurred in a row.sustained-outage
would be emitted when a configured number of pings failed. Then I'd modify the SES notification plugin to only send its new outage email during theonSustainedOutage
callback.Is there a better way? Am I missing the existence of this feature in WatchMen as it is today?