iloire / watchmen

A simple node.js service monitor
MIT License
941 stars 199 forks source link

Watchmen Down. #78

Open nysky1 opened 7 years ago

nysky1 commented 7 years ago

I'm using the docker version of watchmen. Love it! But every few months, watchmen seems to have a problem. The service begins marking almost all sites down (currently have 12) when they are not. They are marked active again after about a minute (interval check) only to be then remarked as down shortly thereafter. I generally just reboot the server but today, that didn't do it. Can I get you a log or anything else that might help both of us?! Issue is active as we speak.

nysky1 commented 7 years ago

Running 3.3.1.

watchmenserver_1 | HWL check failed!. Error: {"code":"ETIMEDOUT","connect":true} watchmenserver_1 | HWL is still down!. Error: {"code":"ETIMEDOUT","connect":true} watchmenserver_1 | FFP (API) check failed!. Error: {"code":"ESOCKETTIMEDOUT","connect":false} watchmenserver_1 | FFP (API) down!. Error: {"code":"ESOCKETTIMEDOUT","connect":false} watchmenserver_1 | FFP (Mobile) check failed!. Error: {"code":"ESOCKETTIMEDOUT","connect":false} watchmenserver_1 | FFP (Mobile) down!. Error: {"code":"ESOCKETTIMEDOUT","connect":false} watchmenserver_1 | MM (Registration) check failed!. Error: {"code":"ESOCKETTIMEDOUT","connect":false} watchmenserver_1 | MM (Registration) down!. Error: {"code":"ESOCKETTIMEDOUT","connect":false} watchmenserver_1 | MC Prod check failed!. Error: {"code":"ETIMEDOUT","connect":false} watchmenserver_1 | MC Prod down!. Error: {"code":"ETIMEDOUT","connect":false}

nysky1 commented 7 years ago

After watching the time to respond for each site within the docker machine, most sites were taking MUCH longer than normal which could have been attributed to slow DNS lookups. Perhaps there were not enough threads available, thus causing ECOCKETTIMEDOUT messages? Any guidelines on optimal thread configuration for Ubuntu? And would it be best to configure threads inside the run-monitor-server.js?

Something like... ....

var WatchMenFactory = require('./lib/watchmen');
var sentinelFactory = require('./lib/sentinel');

process.env.UV_THREADPOOL_SIZE = 10; //<-- Insert?

var RETURN_CODES = {
  OK: 0,
  BAD_STORAGE: 1,
  GENERIC_ERROR: 2
};

....

Thanks!

elboletaire commented 7 years ago

I'm having this same issue, but not after a few months. With a fresh install watchmen states that some services are down (randomly) when they are not.

The error messages tend to be ETIMEDOUT errors.

nysky1 commented 7 years ago

When DNS lookups get slow, that's when the problem presents. I believe I tweaked some timeout settings on the nginx config.

Sent via iPhone

On Apr 11, 2017, at 3:46 AM, Òscar Casajuana notifications@github.com wrote:

I'm having this same issue, but not after a few months. With a fresh install watchmen states that some services are down (randomly) when they are not.

The error messages tend to be ETIMEDOUT errors.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.