Ive been trying to solve this issue for a long time now, but it's rather complicated throubleshooting since I'm a little bit of a novice to this setup. Here in our company we have a Nagios XI (Core 4.2.4) being offloaded by a Gearman Server which is being accessed through port forwarding by workers at remote sites, each processing a single hostgroup queue in their respective local networks.
The problem starts after around 3 to 4 hours of processing: watching gearman_top, we noticed that "Jobs Waiting" begin to pile up in the check_results queue and this value keeps getting larger by the minute, without ever going down. Meanwhile, the Nagios XI services and hosts stop being processed completely and indefinitely until we reset the gearmand and nagios services. Tried installing the latest gearmand-server version provided by https://assets.nagios.com/downloads/nagiosxi/docs/Integrating_Mod_Gearman_with_Nagios_XI.pdf as well as Consol Labs Repositories, but nothing seems to change this behaviour.
While the jobs are stuck, we've found that some large amounts of CLOSE_WAIT connections for each of the workers are displayed through "netstat -anp | grep 4730". Our structure consists of around 1050 services of which 500 or so are handled by a sum of 15 workers. Please, would you be able to shed a light on what's going on?
Thank you very much for your attention and time!
Best regards,
Ramiro Fróes Ferrão
PS: Current gearmand-server package installed is 0.33-2, provided by Nagios XI documentation.
Hello everyone,
Good afternoon.
Ive been trying to solve this issue for a long time now, but it's rather complicated throubleshooting since I'm a little bit of a novice to this setup. Here in our company we have a Nagios XI (Core 4.2.4) being offloaded by a Gearman Server which is being accessed through port forwarding by workers at remote sites, each processing a single hostgroup queue in their respective local networks.
The problem starts after around 3 to 4 hours of processing: watching gearman_top, we noticed that "Jobs Waiting" begin to pile up in the check_results queue and this value keeps getting larger by the minute, without ever going down. Meanwhile, the Nagios XI services and hosts stop being processed completely and indefinitely until we reset the gearmand and nagios services. Tried installing the latest gearmand-server version provided by https://assets.nagios.com/downloads/nagiosxi/docs/Integrating_Mod_Gearman_with_Nagios_XI.pdf as well as Consol Labs Repositories, but nothing seems to change this behaviour.
While the jobs are stuck, we've found that some large amounts of CLOSE_WAIT connections for each of the workers are displayed through "netstat -anp | grep 4730". Our structure consists of around 1050 services of which 500 or so are handled by a sum of 15 workers. Please, would you be able to shed a light on what's going on?
Thank you very much for your attention and time!
Best regards, Ramiro Fróes Ferrão
PS: Current gearmand-server package installed is 0.33-2, provided by Nagios XI documentation.