Closed fpeterschmitt closed 8 years ago
What kind of load is it ? Is it user/system CPU, or I/O wait ? I had problems under LXC where the master daemons processes, which should only fork a single child, forked so many children processes that it overloaded the underlying host.
Is it a similar symptom ?
It is CPU load and I said I had the same problem under VMWare virtual machines. Also, there are no such fork since I have 0 check configured as there is no host nor services configured.
So, no, it's not the same symptom.
same issue observed for us. the problem is not really with the poller, but with the nrpe booster.
Having a look at the source code, we've seen that all temporisation functions have been removed from the source code, so that the poller, when it has nothing to do, loop forever to get something to do. We have done a small patch for the booster_nrpe. Now poller uses 2 % cpu. It can use less if you increase the value of the temporisation; Currently we set 0.1 s, meaning processes are interrupted 10 times by seconds to check if something has arrived.
here is the code in diff mode.
# diff booster_nrpe.py booster_nrpe.py.new
224a225
> logger.debug("... host=\'{0}' port={1}".format(self.nrpe.host, self.nrpe.port) )
383c384,393
< def get_new_checks(self):
---
> def get_new_checks(self, has_running=False):
>
> ## if main loop is not waitinf for something block in incomming actions.
> block = False
> timeout = 0
> inserted = 0
>
> if not has_running:
> block = True
> timeout = 0.1
386c396
< msg = self.s.get(block=False)
---
> msg = self.s.get(block, timeout)
388c398,399
< return
---
> return inserted
>
389a401
> inserted = 1
392a405,406
> return inserted
>
395a410,411
>
> launched = 0
422c438
< 'is not correct.', 8012)
---
> 'or command are not correct.', 8012)
426,427d441
< # if no command is specified, check_nrpe
< # sends _NRPE_CHECK as default command.
429c443
< command = '_NRPE_CHECK'
---
> command='_NRPE_CHECK'
436a451,452
> launched += 1
> return launched
486a503,505
> # count the number of check still in list
> return len(self.checks)
>
511a531,535
> has_running = False
> received = 0
> launched = 0
> waiting = 0
>
523c547,548
< self.get_new_checks()
---
> received = self.get_new_checks(has_running)
>
525c550
< self.launch_new_checks()
---
> launched = self.launch_new_checks()
528c553,559
< self.manage_finished_checks()
---
> waiting = self.manage_finished_checks()
>
> if received > 0 or launched > 0 or waiting > 0:
> has_running = True
> logger.debug("[nrpebooster][do_work] received={0} launched={1} waiting={2}".format(received, launched, waiting) )
> else:
> has_running = False
So this is a nrpebooster issue ? Could you provide a PR here : https://github.com/shinken-monitoring/mod-booster-nrpe ?
I close as it's a booster nrpe module issue (I should never wrote this module, endless bug entropy source...)
Hi,
I have a big problem with Shinken 2.4 eating CPU for nothing. I manage the configuration with Ansible, so I was able to do an brand new install from scratch, on Debian Jessie, with Shinken 2.4, in LXC containers with Kernel 4.3.3.
I had the same problem with Debian Jessie virtual machines under VMWare.
The problem is that all active pollers are eating 100%CPU, leading to a load of 15 on a 2cpu/4thread intel i3. But there is no host and no service at all in any realm. I was about to insert some host/service configuration but I saw that, and I think having programs eating CPU while they have nothing to do could be a bug.
May I submit you my full configuration & setup with shinken 2.4:
FR
,BE
andDE
, members of theAll
realm.Here is part of the architecture diagram:
Here are two realms, with 6 machines. There are still 1 machine missing for the third and last realm:
The idea here is to alway have a scheduler/poller doing his job for DE and BE realms, and having a Broker/Reactionner/Receiver and Master available on DE and BE, with BE as "preferred" master.
The other realm, missing on the diagram, is somewhat standalone: poller/scheduler dedicated to it, no spare, has it's own B/R/R if no one is available in DE or BE.
If i'm not clear, tell me, I'll do my best.
I submit you the full configuration: