sni / mod_gearman

Distribute Naemon Host/Service Checks & Eventhandler with Gearman Queues. Host/Servicegroups affinity included.
http://www.mod-gearman.org
GNU General Public License v3.0
122 stars 42 forks source link

Nagios4+ModGearman3 hanging #109

Closed Rfferrao87 closed 5 years ago

Rfferrao87 commented 7 years ago

Hello everyone,

Good afternoon.

Ive been trying to solve this issue for a long time now, but it's rather complicated throubleshooting since I'm a little bit of a novice to this setup. Here in our company we have a Nagios XI (Core 4.2.4) being offloaded by a Gearman Server which is being accessed through port forwarding by workers at remote sites, each processing a single hostgroup queue in their respective local networks.

The problem starts after around 3 to 4 hours of processing: watching gearman_top, we noticed that "Jobs Waiting" begin to pile up in the check_results queue and this value keeps getting larger by the minute, without ever going down. Meanwhile, the Nagios XI services and hosts stop being processed completely and indefinitely until we reset the gearmand and nagios services. Tried installing the latest gearmand-server version provided by https://assets.nagios.com/downloads/nagiosxi/docs/Integrating_Mod_Gearman_with_Nagios_XI.pdf as well as Consol Labs Repositories, but nothing seems to change this behaviour.

While the jobs are stuck, we've found that some large amounts of CLOSE_WAIT connections for each of the workers are displayed through "netstat -anp | grep 4730". Our structure consists of around 1050 services of which 500 or so are handled by a sum of 15 workers. Please, would you be able to shed a light on what's going on?

Thank you very much for your attention and time!

Best regards, Ramiro Fróes Ferrão

PS: Current gearmand-server package installed is 0.33-2, provided by Nagios XI documentation.

p-alik commented 7 years ago

The issue is related to #105. Here is the topic in gearmand group

Rfferrao87 commented 5 years ago

Old (and forgotten by me) issue, already solved by raising number of threads for the gearmand service.