shinken-solutions / shinken

Flexible and scalable monitoring framework
http://www.shinken-monitoring.org
GNU Affero General Public License v3.0
1.13k stars 336 forks source link

[shinken 2.4] pollers eating CPU doing nothing at all #1798

Closed fpeterschmitt closed 8 years ago

fpeterschmitt commented 8 years ago

Hi,

I have a big problem with Shinken 2.4 eating CPU for nothing. I manage the configuration with Ansible, so I was able to do an brand new install from scratch, on Debian Jessie, with Shinken 2.4, in LXC containers with Kernel 4.3.3.

I had the same problem with Debian Jessie virtual machines under VMWare.

The problem is that all active pollers are eating 100%CPU, leading to a load of 15 on a 2cpu/4thread intel i3. But there is no host and no service at all in any realm. I was about to insert some host/service configuration but I saw that, and I think having programs eating CPU while they have nothing to do could be a bug.

May I submit you my full configuration & setup with shinken 2.4:

Here is part of the architecture diagram:

archi

Here are two realms, with 6 machines. There are still 1 machine missing for the third and last realm:

The idea here is to alway have a scheduler/poller doing his job for DE and BE realms, and having a Broker/Reactionner/Receiver and Master available on DE and BE, with BE as "preferred" master.

The other realm, missing on the diagram, is somewhat standalone: poller/scheduler dedicated to it, no spare, has it's own B/R/R if no one is available in DE or BE.

If i'm not clear, tell me, I'll do my best.

I submit you the full configuration:

# file managed by Ansible

define arbiter {
    arbiter_name    arbiter_be-01
    address         10.2.3.1
    port            7770
    spare           0
    host_name       be-01

    modules             
    use_ssl             0
    hard_ssl_name_check 0
    }

define broker {
    broker_name     be-01
    address         10.2.3.1
    port            7772
    spare           0

    manage_arbiters     1
    manage_sub_realms   1

    timeout             3
    data_timeout        120
    max_check_attempts  3
    check_interval      60

    modules         livestatus

    use_ssl             0
    hard_ssl_name_check 0

    realm           All

    }

define reactionner {
    reactionner_name    be-01
    address             10.2.3.1
    port                7769
    spare               0

    ## Optionnal
    manage_sub_realms   0   ; Does it take jobs from schedulers of sub-Realms?
    min_workers         1   ; Starts with N processes (0 = 1 per CPU)
    max_workers         15  ; No more than N processes (0 = 1 per CPU)
    polling_interval    1   ; Get jobs from schedulers each 1 second
    timeout             3   ; Ping timeout
    data_timeout        120 ; Data send timeout
    max_check_attempts  3   ; If ping fails N or more, then the node is dead
    check_interval      60  ; Ping node every N seconds

    realm   All

    }

define receiver {
    receiver_name   be-01
    address         10.2.3.1
    port            7773
    spare           0

    modules         

    use_ssl                 0
    hard_ssl_name_check     0

    ## Advanced Feature
    direct_routing      0   ; If enabled, it will directly send commands to the
                            ; schedulers if it knows about the hostname in the
                            ; command.
    realm   All

    }
define arbiter {
    arbiter_name    arbiter_de-05
    address         10.2.3.5
    port            7770
    spare           1
    host_name       de-05

    modules             
    use_ssl             0
    hard_ssl_name_check 0
    }

define broker {
    broker_name     de-05
    address         10.2.3.5
    port            7772
    spare           1

    manage_arbiters     1
    manage_sub_realms   1

    timeout             3
    data_timeout        120
    max_check_attempts  3
    check_interval      60

    modules         livestatus

    use_ssl             0
    hard_ssl_name_check 0

    realm           All

    }

define reactionner {
    reactionner_name    de-05
    address             10.2.3.5
    port                7769
    spare               1

    ## Optionnal
    manage_sub_realms   0   ; Does it take jobs from schedulers of sub-Realms?
    min_workers         1   ; Starts with N processes (0 = 1 per CPU)
    max_workers         15  ; No more than N processes (0 = 1 per CPU)
    polling_interval    1   ; Get jobs from schedulers each 1 second
    timeout             3   ; Ping timeout
    data_timeout        120 ; Data send timeout
    max_check_attempts  3   ; If ping fails N or more, then the node is dead
    check_interval      60  ; Ping node every N seconds

    realm   All

    }

define receiver {
    receiver_name   de-05
    address         10.2.3.5
    port            7773
    spare           1

    modules         

    use_ssl                 0
    hard_ssl_name_check     0

    ## Advanced Feature
    direct_routing      0   ; If enabled, it will directly send commands to the
                            ; schedulers if it knows about the hostname in the
                            ; command.
    realm   All

    }

define broker {
    broker_name     fr-08
    address         10.2.3.8
    port            7772
    spare           1

    manage_arbiters     1
    manage_sub_realms   1

    timeout             3
    data_timeout        120
    max_check_attempts  3
    check_interval      60

    modules         livestatus

    use_ssl             0
    hard_ssl_name_check 0

    realm           FR

    }

define reactionner {
    reactionner_name    fr-08
    address             10.2.3.8
    port                7769
    spare               1

    ## Optionnal
    manage_sub_realms   0   ; Does it take jobs from schedulers of sub-Realms?
    min_workers         1   ; Starts with N processes (0 = 1 per CPU)
    max_workers         15  ; No more than N processes (0 = 1 per CPU)
    polling_interval    1   ; Get jobs from schedulers each 1 second
    timeout             3   ; Ping timeout
    data_timeout        120 ; Data send timeout
    max_check_attempts  3   ; If ping fails N or more, then the node is dead
    check_interval      60  ; Ping node every N seconds

    realm   FR

    }

define receiver {
    receiver_name   fr-08
    address         10.2.3.8
    port            7773
    spare           1

    modules         

    use_ssl                 0
    hard_ssl_name_check     0

    ## Advanced Feature
    direct_routing      0   ; If enabled, it will directly send commands to the
                            ; schedulers if it knows about the hostname in the
                            ; command.
    realm   FR

    }

define scheduler {
    scheduler_name      be-02
    address             10.2.3.2
    port                7768
    spare               0

    modules             pickle-retention-file
    realm   BE

    # Skip initial broks creation. Boot fast, but some broker modules won't
    # work with it!
    skip_initial_broks  0

    # Enable https or not
    use_ssl           0
    # enable certificate/hostname check, will avoid man in the middle attacks
    hard_ssl_name_check   0

    }

define poller {
    poller_name     be-02
    address         10.2.3.2
    port            7771
    spare           0

    modules         booster-nrpe

    use_ssl           0
    hard_ssl_name_check   0

    realm   BE

    }

define scheduler {
    scheduler_name      de-06
    address             10.2.3.6
    port                7768
    spare               0

    modules             pickle-retention-file
    realm   DE

    # Skip initial broks creation. Boot fast, but some broker modules won't
    # work with it!
    skip_initial_broks  0

    # Enable https or not
    use_ssl           0
    # enable certificate/hostname check, will avoid man in the middle attacks
    hard_ssl_name_check   0

    }

define poller {
    poller_name     de-06
    address         10.2.3.6
    port            7771
    spare           0

    modules         booster-nrpe

    use_ssl           0
    hard_ssl_name_check   0

    realm   DE

    }

define scheduler {
    scheduler_name      fr-08
    address             10.2.3.8
    port                7768
    spare               0

    modules             pickle-retention-file
    realm   FR

    # Skip initial broks creation. Boot fast, but some broker modules won't
    # work with it!
    skip_initial_broks  0

    # Enable https or not
    use_ssl           0
    # enable certificate/hostname check, will avoid man in the middle attacks
    hard_ssl_name_check   0

    }

define poller {
    poller_name     fr-08
    address         10.2.3.8
    port            7771
    spare           0

    modules         booster-nrpe

    use_ssl           0
    hard_ssl_name_check   0

    realm   FR

    }

define scheduler {
    scheduler_name      be-04
    address             10.2.3.4
    port                7768
    spare               1

    modules             pickle-retention-file
    realm   DE

    # Skip initial broks creation. Boot fast, but some broker modules won't
    # work with it!
    skip_initial_broks  0

    # Enable https or not
    use_ssl           0
    # enable certificate/hostname check, will avoid man in the middle attacks
    hard_ssl_name_check   0

    }

define poller {
    poller_name     be-04
    address         10.2.3.4
    port            7771
    spare           1

    modules         booster-nrpe

    use_ssl           0
    hard_ssl_name_check   0

    realm   DE

    }

define scheduler {
    scheduler_name      de-07
    address             10.2.3.7
    port                7768
    spare               1

    modules             pickle-retention-file
    realm   BE

    # Skip initial broks creation. Boot fast, but some broker modules won't
    # work with it!
    skip_initial_broks  0

    # Enable https or not
    use_ssl           0
    # enable certificate/hostname check, will avoid man in the middle attacks
    hard_ssl_name_check   0

    }

define poller {
    poller_name     de-07
    address         10.2.3.7
    port            7771
    spare           1

    modules         booster-nrpe

    use_ssl           0
    hard_ssl_name_check   0

    realm   BE

    }
geektophe commented 8 years ago

What kind of load is it ? Is it user/system CPU, or I/O wait ? I had problems under LXC where the master daemons processes, which should only fork a single child, forked so many children processes that it overloaded the underlying host.

Is it a similar symptom ?

fpeterschmitt commented 8 years ago

It is CPU load and I said I had the same problem under VMWare virtual machines. Also, there are no such fork since I have 0 check configured as there is no host nor services configured.

So, no, it's not the same symptom.

jfpik commented 8 years ago

same issue observed for us. the problem is not really with the poller, but with the nrpe booster.

Having a look at the source code, we've seen that all temporisation functions have been removed from the source code, so that the poller, when it has nothing to do, loop forever to get something to do. We have done a small patch for the booster_nrpe. Now poller uses 2 % cpu. It can use less if you increase the value of the temporisation; Currently we set 0.1 s, meaning processes are interrupted 10 times by seconds to check if something has arrived.

here is the code in diff mode.

# diff booster_nrpe.py booster_nrpe.py.new
224a225
>                     logger.debug("... host=\'{0}' port={1}".format(self.nrpe.host, self.nrpe.port) )
383c384,393
<     def get_new_checks(self):
---
>     def get_new_checks(self, has_running=False):
> 
>         ## if main loop is not waitinf for something block in incomming actions.
>         block = False
>         timeout = 0
>         inserted = 0
> 
>         if not has_running:
>             block = True
>             timeout = 0.1
386c396
<                 msg = self.s.get(block=False)
---
>                 msg = self.s.get(block, timeout)
388c398,399
<                 return
---
>                 return inserted
> 
389a401
>                 inserted = 1
392a405,406
>         return inserted
> 
395a410,411
> 
>         launched = 0
422c438
<                                       'is not correct.', 8012)
---
>                                       'or command are not correct.', 8012)
426,427d441
<                 # if no command is specified, check_nrpe
<                 # sends _NRPE_CHECK as default command.
429c443
<                     command = '_NRPE_CHECK'
---
>                      command='_NRPE_CHECK'
436a451,452
>                 launched += 1
>         return launched
486a503,505
>         # count the number of check still in list
>         return len(self.checks)
> 
511a531,535
>         has_running = False
>         received = 0
>         launched = 0
>         waiting = 0
> 
523c547,548
<                 self.get_new_checks()
---
>                 received = self.get_new_checks(has_running)
> 
525c550
<                 self.launch_new_checks()
---
>                 launched = self.launch_new_checks()
528c553,559
<             self.manage_finished_checks()
---
>             waiting = self.manage_finished_checks()
> 
>             if received > 0 or launched > 0 or waiting > 0:
>                 has_running = True
>                 logger.debug("[nrpebooster][do_work] received={0} launched={1} waiting={2}".format(received, launched, waiting) )
>             else:
>                 has_running = False
olivierHa commented 8 years ago

So this is a nrpebooster issue ? Could you provide a PR here : https://github.com/shinken-monitoring/mod-booster-nrpe ?

naparuba commented 8 years ago

I close as it's a booster nrpe module issue (I should never wrote this module, endless bug entropy source...)