shinken-solutions / shinken

Flexible and scalable monitoring framework
http://www.shinken-monitoring.org
GNU Affero General Public License v3.0
1.13k stars 336 forks source link

Shinken stops doing anything intermittently #1636

Open dmsimard opened 9 years ago

dmsimard commented 9 years ago

We have a pretty annoying issue that persists even after upgrading to Shinken 2.4: Every hour or so, Shinken schedulers and pollers will stop doing anything for up to 10 minutes.

We've tried many things in order to try and resolve the issue but running out of ideas right now..

Some insight I believe could be relevant on what the setup and configuration looks like:

The realm configuration looks like this:

Realm "All":

Realm "Default":

Realm "01":

Realm "N":

Some data on the problem: Performance info

Graphs

Looking at the logs, this seems to correspond to regular re-distribution of the configuration:

[Tue May 19 02:16:56 2015] INFO: [Shinken] Retention information updated in Redis
[Tue May 19 02:16:56 2015] WARNING: [Shinken] 6 actions never came back for the satellite 'node.monitoring.tld'.I reenable them for polling
[Tue May 19 02:16:56 2015] WARNING: [Shinken] A system time change of 932 has been detected. Compensating...
[Tue May 19 02:17:33 2015] INFO: [Shinken] Retention information updated in Redis
[Tue May 19 02:17:33 2015] INFO: [Shinken] Waiting for initial configuration
[Tue May 19 02:17:33 2015] INFO: [Shinken] New configuration received
[Tue May 19 02:17:33 2015] INFO: [Shinken] Correctly loaded dummy_scheduler as an old-new-style shinken module :|
[Tue May 19 02:17:33 2015] INFO: [Shinken] Correctly loaded dummy_poller as an old-new-style shinken module :|
[Tue May 19 02:17:33 2015] INFO: [Shinken] Correctly loaded dummy_broker_external as an old-new-style shinken module :|
[Tue May 19 02:17:33 2015] INFO: [Shinken] Correctly loaded retention-memcache as an old-new-style shinken module :|
[Tue May 19 02:17:33 2015] INFO: [Shinken] Correctly loaded retention-redis as an old-new-style shinken module :|
[Tue May 19 02:17:33 2015] INFO: [Shinken] Correctly loaded dummy_broker as an old-new-style shinken module :|
[Tue May 19 02:17:33 2015] INFO: [Shinken] Correctly loaded dummy_arbiter as an old-new-style shinken module :|
[Tue May 19 02:17:33 2015] INFO: [Shinken] Trying to init module: RedisRetention
[Tue May 19 02:17:33 2015] INFO: [Shinken] I correctly loaded the modules: [RedisRetention]
[Tue May 19 02:17:33 2015] INFO: [Shinken] Loading configuration.
[Tue May 19 02:17:33 2015] INFO: [Shinken] New configuration loaded
[Tue May 19 02:17:35 2015] INFO: [Shinken] [RedisRetention] Retention objects loaded successfully.
[Tue May 19 02:17:35 2015] INFO: [Shinken] [node.monitoring.tld] First scheduling launched
[Tue May 19 02:17:36 2015] INFO: [Shinken] [node.monitoring.tld] First scheduling done
[Tue May 19 02:21:59 2015] INFO: [Shinken] A new broker just connected : broker.monitoring.tld
[Tue May 19 02:22:00 2015] INFO: [Shinken] [node.monitoring.tld] Created 1245 initial Broks for broker broker.monitoring.tld
[Tue May 19 03:33:54 2015] INFO: [Shinken] Retention information updated in Redis
[Tue May 19 03:33:54 2015] WARNING: [Shinken] 11 actions never came back for the satellite 'node.monitoring.tld'.I reenable them for polling
[Tue May 19 03:33:54 2015] WARNING: [Shinken] A system time change of 933 has been detected. Compensating...
[Tue May 19 03:34:11 2015] INFO: [Shinken] Retention information updated in Redis
[Tue May 19 03:34:11 2015] INFO: [Shinken] Waiting for initial configuration
[Tue May 19 03:34:11 2015] INFO: [Shinken] New configuration received
[Tue May 19 03:34:11 2015] INFO: [Shinken] Correctly loaded dummy_scheduler as an old-new-style shinken module :|
[Tue May 19 03:34:11 2015] INFO: [Shinken] Correctly loaded dummy_poller as an old-new-style shinken module :|
[Tue May 19 03:34:11 2015] INFO: [Shinken] Correctly loaded dummy_broker_external as an old-new-style shinken module :|
[Tue May 19 03:34:11 2015] INFO: [Shinken] Correctly loaded retention-memcache as an old-new-style shinken module :|
[Tue May 19 03:34:11 2015] INFO: [Shinken] Correctly loaded retention-redis as an old-new-style shinken module :|
[Tue May 19 03:34:11 2015] INFO: [Shinken] Correctly loaded dummy_broker as an old-new-style shinken module :|
[Tue May 19 03:34:11 2015] INFO: [Shinken] Correctly loaded dummy_arbiter as an old-new-style shinken module :|
[Tue May 19 03:34:11 2015] INFO: [Shinken] Trying to init module: RedisRetention
[Tue May 19 03:34:11 2015] INFO: [Shinken] I correctly loaded the modules: [RedisRetention]
[Tue May 19 03:34:11 2015] INFO: [Shinken] Loading configuration.
[Tue May 19 03:34:11 2015] INFO: [Shinken] New configuration loaded
[Tue May 19 03:34:13 2015] INFO: [Shinken] [RedisRetention] Retention objects loaded successfully.
[Tue May 19 03:34:13 2015] INFO: [Shinken] [node.monitoring.tld] First scheduling launched
[Tue May 19 03:34:13 2015] INFO: [Shinken] [node.monitoring.tld] First scheduling done
[Tue May 19 03:34:13 2015] INFO: [Shinken] A new broker just connected : broker.monitoring.tld
[Tue May 19 03:34:14 2015] INFO: [Shinken] [node.monitoring.tld] Created 1245 initial Broks for broker broker.monitoring.tld
[Tue May 19 04:50:43 2015] INFO: [Shinken] Retention information updated in Redis
[Tue May 19 04:50:43 2015] WARNING: [Shinken] 15 actions never came back for the satellite 'node.monitoring.tld'.I reenable them for polling
[Tue May 19 04:50:43 2015] WARNING: [Shinken] A system time change of 944 has been detected. Compensating...
[Tue May 19 04:51:36 2015] INFO: [Shinken] Retention information updated in Redis
[Tue May 19 04:51:36 2015] INFO: [Shinken] Waiting for initial configuration
[Tue May 19 04:51:36 2015] INFO: [Shinken] New configuration received
[Tue May 19 04:51:37 2015] INFO: [Shinken] Correctly loaded dummy_scheduler as an old-new-style shinken module :|
[Tue May 19 04:51:37 2015] INFO: [Shinken] Correctly loaded dummy_poller as an old-new-style shinken module :|
[Tue May 19 04:51:37 2015] INFO: [Shinken] Correctly loaded dummy_broker_external as an old-new-style shinken module :|
[Tue May 19 04:51:37 2015] INFO: [Shinken] Correctly loaded retention-memcache as an old-new-style shinken module :|
[Tue May 19 04:51:37 2015] INFO: [Shinken] Correctly loaded retention-redis as an old-new-style shinken module :|
[Tue May 19 04:51:37 2015] INFO: [Shinken] Correctly loaded dummy_broker as an old-new-style shinken module :|
[Tue May 19 04:51:37 2015] INFO: [Shinken] Correctly loaded dummy_arbiter as an old-new-style shinken module :|
[Tue May 19 04:51:37 2015] INFO: [Shinken] Trying to init module: RedisRetention
[Tue May 19 04:51:37 2015] INFO: [Shinken] I correctly loaded the modules: [RedisRetention]
[Tue May 19 04:51:37 2015] INFO: [Shinken] Loading configuration.
[Tue May 19 04:51:37 2015] INFO: [Shinken] New configuration loaded
[Tue May 19 04:51:39 2015] INFO: [Shinken] [RedisRetention] Retention objects loaded successfully.
[Tue May 19 04:51:39 2015] INFO: [Shinken] [node.monitoring.tld] First scheduling launched
[Tue May 19 04:51:39 2015] INFO: [Shinken] [node.monitoring.tld] First scheduling done
[Tue May 19 04:51:39 2015] INFO: [Shinken] A new broker just connected : broker.monitoring.tld
[Tue May 19 04:51:39 2015] INFO: [Shinken] [node.monitoring.tld] Created 1245 initial Broks for broker broker.monitoring.tld

We definitely don't reload the configuration on a regular basis so it looks like this is Shinken's own doing.

There is also what might be related but could be otherwise another issue - our Arbiter often gets timeouts when distributing the configuration to the schedulers and pollers. This isn't a network issue - if you curl the https link directly, it will not respond "OK", it will really timeout and doing a tcpdump, you see that the curl request indeed makes it to the server. It looks like the poller or the scheduler gets stuck on something but eventually comes back available by itself. Stopping and starting the services resolves this but is not a viable solution. I turned on debug logging on the pollers/schedulers but haven't found anything that seemed relevant.

Any ideas ? I can provide more details if need be, this is a critical issue for us.

olivierHa commented 9 years ago

Did you try without the redis retention module ? Le 19 mai 2015 8:15 PM, "dmsimard" notifications@github.com a écrit :

We have a pretty annoying issue that persists even after upgrading to Shinken 2.4: Every hour or so, Shinken schedulers and pollers will stop doing anything for up to 10 minutes.

We've tried many things in order to try and resolve the issue but running out of ideas right now..

Some insight I believe could be relevant on what the setup and configuration looks like:

  • All servers are running Ubuntu 14.04 with the latest packages, including Shinken 2.4.
  • Scheduler retention is in redis on the host where the arbiter is, latest version of the module.
  • All servers are ntp-synchronized (logs seem to imply something is off..)
  • Schedulers and pollers are SSL but not hard_ssl.
  • 1k+ hosts, 15k+ services across 5 realms (5 realms member of "All", nothing is directly in "All")
  • Livestatus and Thruk

The realm configuration looks like this:

Realm "All":

  • Arbiter
  • Broker
  • Reactioner

Realm "Default":

  • Two active schedulers
  • One poller

Realm "01":

  • One scheduler
  • One poller

Realm "N":

  • One scheduler
  • One poller

Some data on the problem: [image: Performance info] https://camo.githubusercontent.com/3a4cfbb7cc7e6cd60294f1480889a19052984002/687474703a2f2f692e696d6775722e636f6d2f3869504d4e64362e706e67

[image: Graphs] https://camo.githubusercontent.com/a8fd9aea2afa74f51e1d9cf38b714f4e68423d8d/687474703a2f2f692e696d6775722e636f6d2f344276366d78662e706e67

Looking at the logs, this seems to correspond to regular re-distribution of the configuration:

[Tue May 19 02:16:56 2015] INFO: [Shinken] Retention information updated in Redis [Tue May 19 02:16:56 2015] WARNING: [Shinken] 6 actions never came back for the satellite 'node.monitoring.tld'.I reenable them for polling [Tue May 19 02:16:56 2015] WARNING: [Shinken] A system time change of 932 has been detected. Compensating... [Tue May 19 02:17:33 2015] INFO: [Shinken] Retention information updated in Redis [Tue May 19 02:17:33 2015] INFO: [Shinken] Waiting for initial configuration [Tue May 19 02:17:33 2015] INFO: [Shinken] New configuration received [Tue May 19 02:17:33 2015] INFO: [Shinken] Correctly loaded dummy_scheduler as an old-new-style shinken module :| [Tue May 19 02:17:33 2015] INFO: [Shinken] Correctly loaded dummy_poller as an old-new-style shinken module :| [Tue May 19 02:17:33 2015] INFO: [Shinken] Correctly loaded dummy_broker_external as an old-new-style shinken module :| [Tue May 19 02:17:33 2015] INFO: [Shinken] Correctly loaded retention-memcache as an old-new-style shinken module :| [Tue May 19 02:17:33 2015] INFO: [Shinken] Correctly loaded retention-redis as an old-new-style shinken module :| [Tue May 19 02:17:33 2015] INFO: [Shinken] Correctly loaded dummy_broker as an old-new-style shinken module :| [Tue May 19 02:17:33 2015] INFO: [Shinken] Correctly loaded dummy_arbiter as an old-new-style shinken module :| [Tue May 19 02:17:33 2015] INFO: [Shinken] Trying to init module: RedisRetention [Tue May 19 02:17:33 2015] INFO: [Shinken] I correctly loaded the modules: [RedisRetention] [Tue May 19 02:17:33 2015] INFO: [Shinken] Loading configuration. [Tue May 19 02:17:33 2015] INFO: [Shinken] New configuration loaded [Tue May 19 02:17:35 2015] INFO: [Shinken] [RedisRetention] Retention objects loaded successfully. [Tue May 19 02:17:35 2015] INFO: [Shinken] [node.monitoring.tld] First scheduling launched [Tue May 19 02:17:36 2015] INFO: [Shinken] [node.monitoring.tld] First scheduling done [Tue May 19 02:21:59 2015] INFO: [Shinken] A new broker just connected : broker.monitoring.tld [Tue May 19 02:22:00 2015] INFO: [Shinken] [node.monitoring.tld] Created 1245 initial Broks for broker broker.monitoring.tld [Tue May 19 03:33:54 2015] INFO: [Shinken] Retention information updated in Redis [Tue May 19 03:33:54 2015] WARNING: [Shinken] 11 actions never came back for the satellite 'node.monitoring.tld'.I reenable them for polling [Tue May 19 03:33:54 2015] WARNING: [Shinken] A system time change of 933 has been detected. Compensating... [Tue May 19 03:34:11 2015] INFO: [Shinken] Retention information updated in Redis [Tue May 19 03:34:11 2015] INFO: [Shinken] Waiting for initial configuration [Tue May 19 03:34:11 2015] INFO: [Shinken] New configuration received [Tue May 19 03:34:11 2015] INFO: [Shinken] Correctly loaded dummy_scheduler as an old-new-style shinken module :| [Tue May 19 03:34:11 2015] INFO: [Shinken] Correctly loaded dummy_poller as an old-new-style shinken module :| [Tue May 19 03:34:11 2015] INFO: [Shinken] Correctly loaded dummy_broker_external as an old-new-style shinken module :| [Tue May 19 03:34:11 2015] INFO: [Shinken] Correctly loaded retention-memcache as an old-new-style shinken module :| [Tue May 19 03:34:11 2015] INFO: [Shinken] Correctly loaded retention-redis as an old-new-style shinken module :| [Tue May 19 03:34:11 2015] INFO: [Shinken] Correctly loaded dummy_broker as an old-new-style shinken module :| [Tue May 19 03:34:11 2015] INFO: [Shinken] Correctly loaded dummy_arbiter as an old-new-style shinken module :| [Tue May 19 03:34:11 2015] INFO: [Shinken] Trying to init module: RedisRetention [Tue May 19 03:34:11 2015] INFO: [Shinken] I correctly loaded the modules: [RedisRetention] [Tue May 19 03:34:11 2015] INFO: [Shinken] Loading configuration. [Tue May 19 03:34:11 2015] INFO: [Shinken] New configuration loaded [Tue May 19 03:34:13 2015] INFO: [Shinken] [RedisRetention] Retention objects loaded successfully. [Tue May 19 03:34:13 2015] INFO: [Shinken] [node.monitoring.tld] First scheduling launched [Tue May 19 03:34:13 2015] INFO: [Shinken] [node.monitoring.tld] First scheduling done [Tue May 19 03:34:13 2015] INFO: [Shinken] A new broker just connected : broker.monitoring.tld [Tue May 19 03:34:14 2015] INFO: [Shinken] [node.monitoring.tld] Created 1245 initial Broks for broker broker.monitoring.tld [Tue May 19 04:50:43 2015] INFO: [Shinken] Retention information updated in Redis [Tue May 19 04:50:43 2015] WARNING: [Shinken] 15 actions never came back for the satellite 'node.monitoring.tld'.I reenable them for polling [Tue May 19 04:50:43 2015] WARNING: [Shinken] A system time change of 944 has been detected. Compensating... [Tue May 19 04:51:36 2015] INFO: [Shinken] Retention information updated in Redis [Tue May 19 04:51:36 2015] INFO: [Shinken] Waiting for initial configuration [Tue May 19 04:51:36 2015] INFO: [Shinken] New configuration received [Tue May 19 04:51:37 2015] INFO: [Shinken] Correctly loaded dummy_scheduler as an old-new-style shinken module :| [Tue May 19 04:51:37 2015] INFO: [Shinken] Correctly loaded dummy_poller as an old-new-style shinken module :| [Tue May 19 04:51:37 2015] INFO: [Shinken] Correctly loaded dummy_broker_external as an old-new-style shinken module :| [Tue May 19 04:51:37 2015] INFO: [Shinken] Correctly loaded retention-memcache as an old-new-style shinken module :| [Tue May 19 04:51:37 2015] INFO: [Shinken] Correctly loaded retention-redis as an old-new-style shinken module :| [Tue May 19 04:51:37 2015] INFO: [Shinken] Correctly loaded dummy_broker as an old-new-style shinken module :| [Tue May 19 04:51:37 2015] INFO: [Shinken] Correctly loaded dummy_arbiter as an old-new-style shinken module :| [Tue May 19 04:51:37 2015] INFO: [Shinken] Trying to init module: RedisRetention [Tue May 19 04:51:37 2015] INFO: [Shinken] I correctly loaded the modules: [RedisRetention] [Tue May 19 04:51:37 2015] INFO: [Shinken] Loading configuration. [Tue May 19 04:51:37 2015] INFO: [Shinken] New configuration loaded [Tue May 19 04:51:39 2015] INFO: [Shinken] [RedisRetention] Retention objects loaded successfully. [Tue May 19 04:51:39 2015] INFO: [Shinken] [node.monitoring.tld] First scheduling launched [Tue May 19 04:51:39 2015] INFO: [Shinken] [node.monitoring.tld] First scheduling done [Tue May 19 04:51:39 2015] INFO: [Shinken] A new broker just connected : broker.monitoring.tld [Tue May 19 04:51:39 2015] INFO: [Shinken] [node.monitoring.tld] Created 1245 initial Broks for broker broker.monitoring.tld

We definitely don't reload the configuration on a regular basis so it looks like this is Shinken's own doing.

Any ideas ? I can provide more details if need be, this is a critical issue for us.

— Reply to this email directly or view it on GitHub https://github.com/naparuba/shinken/issues/1636.

dmsimard commented 9 years ago

@olivierHa Part of the things we tried was to switch to Redis, we previously used Memcached retention.

olivierHa commented 9 years ago

Have you tried to launch a daemon using the cli to see more "output/debug" messages ? And did you try without retention module at all ?

2015-05-19 20:41 GMT+02:00 dmsimard notifications@github.com:

@olivierHa https://github.com/olivierHa Part of the things we tried was to switch to Redis, we previously used Memcached retention.

— Reply to this email directly or view it on GitHub https://github.com/naparuba/shinken/issues/1636#issuecomment-103627365.

dmsimard commented 9 years ago

There's no obvious errors in the logs, even in debug output. Using retention to file (pickle) for scheduler inhibits this problem (shinken doesn't stop anything every hour or so) but reloading the arbiter flushes all the retention.

We don't want to lose the retention (acknowlegements, service statuses, etc.) across shinken reloads.

olivierHa commented 9 years ago

In debug mode you should see when redis retention is called ('asking ...') Maybe saving to redis is taking too much time. And there is something wrong with the time. Le 20 mai 2015 12:19 AM, "dmsimard" notifications@github.com a écrit :

There's no obvious errors in the logs, even in debug output. Using retention to file (pickle) for scheduler inhibits this problem (shinken doesn't stop anything every hour or so) but reloading the arbiter flushes all the retention.

We don't want to lose the retention (acknowlegements, service statuses, etc.) across shinken reloads.

— Reply to this email directly or view it on GitHub https://github.com/naparuba/shinken/issues/1636#issuecomment-103682954.

dmsimard commented 9 years ago

So further troubleshooting this got me thinking the issue might have been related to the fact that there are two active schedulers in one of my realms (the bulk of the hosts and services are there).

My understanding when you have multiple schedulers in one realm is that the arbiter will distribute evenly your highest level of hosts/services (parent hosts) across your schedulers. It looks like this distribution is random and non-determinist and might be the root cause of the problem where hosts and services are shuffled around, causing retention-related issues.

Yesterday, I've removed one of the two schedulers in this specific realm and even with the redis retention, I am no longer having the issue where shinken stops doing anything for 10 minutes every hour.

fpeterschmitt commented 8 years ago

Might be related to #1807 too. Shinken seems not to like many schedulers...