sni / mod_gearman

Distribute Naemon Host/Service Checks & Eventhandler with Gearman Queues. Host/Servicegroups affinity included.
http://www.mod-gearman.org
GNU General Public License v3.0
122 stars 42 forks source link

Error if one of several gearmand servers is down #75

Closed adrianlzt closed 9 years ago

adrianlzt commented 9 years ago

Environment: RedHat 6.5 gearmand-server-0.33-2.x86_64 gearmand-0.33-2.x86_64 mod_gearman-1.5.2-1.el6.x86_64

mod_gearman_neb.conf

server=localhost:4730
server=172.16.1.31:4730

If server 172.16.1.31 is down, mod_gearman_neb writes this to log file:

[2015-02-27 10:43:51][2769][ERROR] sending job to gearmand failed: connect_poll(No route to host) getsockopt() failed -> libgearman/connection.cc:104
[2015-02-27 10:43:51][2769][ERROR] worker error: connect_poll(No route to host) getsockopt() failed -> libgearman/connection.cc:104
[2015-02-27 10:43:54][2769][ERROR] worker error: connect_poll(No route to host) getsockopt() failed -> libgearman/connection.cc:104

But, if 172.16.1.31 is up but gearmand-server down, all works properly.

I have compiled mod_gearman 1.5.2 to use gearmand-1.0.6 and works correctly. Packages: https://drive.google.com/open?id=0B84NO1oyhUXva3A5NHdnMThReGM&authuser=0 https://drive.google.com/open?id=0B84NO1oyhUXvWHNPMzhCcTE4UDQ&authuser=0 https://drive.google.com/open?id=0B84NO1oyhUXvMlNCN0IzQW9rOEU&authuser=0 https://drive.google.com/open?id=0B84NO1oyhUXvbUZtd1RnaDBXN0U&authuser=0

adrianlzt commented 9 years ago

The cure is worse than the disease. With gearmand-1.0.6, if no gearmand is available, mod_gearman_neb start to create sockets until file descriptors exhaustion. Also, with running gearmand, if Icinga tries to execute an active check it gets stuck.

Going back to gearmand 0.33 using pacemaker to give HA.

sni commented 9 years ago

i haven't made good results with any gearman version > 0.33. Here is a list of working versions: http://labs.consol.de/nagios/mod-gearman/#_supported_dependencies

sni commented 9 years ago

Could you give the latest release a try. I fixen an issue with segfaulting on connection errors.