sni / mod_gearman

Distribute Naemon Host/Service Checks & Eventhandler with Gearman Queues. Host/Servicegroups affinity included.
http://www.mod-gearman.org
GNU General Public License v3.0
122 stars 42 forks source link

Host Checks Hanging #133

Closed infraweavers closed 6 years ago

infraweavers commented 6 years ago

Hello,

We have a problem similiar to #111 however it is in a well defined scenario; we're using mod-gearman to create a form of HA between our naemon servers. When we restart our 'slave' omd instance, the host checks on the master behave like #111. In the logs we see the host check execute, however we never see "host job completed" in the neb log for one of the hosts that is hanging. If we go into the GUI and force check the host then the problem goes away. So it looks like the is_executing on the host is never getting set to FALSE.

We run OMD-Labs-2.70 using naemon core like:

    omd config set GEARMAND on
    omd config set GEARMAND_PORT 0.0.0.0:4730
    omd config set GEARMAN_WORKER on
    omd config set LIVESTATUS_TCP on
    omd config set LIVESTATUS_TCP_PORT 6557
    omd config set MOD_GEARMAN on
    omd config set PNP4NAGIOS gearman
    omd config set THRUK_COOKIE_AUTH off

We have 2 servers (omd1, omd2); configured using keepalived so that only 1 of the 2 has active checks configured at any point in time like:

on omd1:

OMD[default]:~/etc/naemon/naemon.d$ cat z_keepalived.cfg
use_retained_program_state=1
enable_notifications=1
execute_service_checks=1
execute_host_checks=1

on omd2:

OMD[default]:~/etc/naemon/naemon.d$ cat z_keepalived.cfg
use_retained_program_state=0
enable_notifications=0
execute_service_checks=0
execute_host_checks=0

We have gearman-worker configured to duplicate the results so that both servers get an up-to-date view and perfdata (and therefore pnp4nagios graphs)

so on omd1:

OMD[default]:~/etc/mod-gearman$ cat worker.cfg  | grep dupserver
dupserver=omd2:4730
# Use dup_results_are_passive to set if the duplicate result send to the dupserver

and on omd2:

OMD[default]:~/etc/mod-gearman$ cat worker.cfg  | grep dupserver
dupserver=omd1:4730
# Use dup_results_are_passive to set if the duplicate result send to the dupserver

We've created 100 hosts which only ping localhost to test this and basically everytime we run omd restart on omd2; it hangs some or all of the checks on omd1.

Now if we set in worker.cfg max-jobs=1; then it appears that the problem goes away however we aren't getting the best of gearman at all.

We have this reproduced in multiple production instances and on test vm's. Any more information or tests to run, I'll be happy to run them but I'm at a bit of a loss where to go from here to debug it. The results just vanish!

infraweavers commented 6 years ago

I've attached some snippets of the logs. You can see that at 2018-05-02 10:07:41 the worker has executed the check, and puts the result on the queue:

[2018-05-02 10:07:41][17105][TRACE] add_job_to_queue(check_results, (null), 2, 1, 1, 1)

however in the neb log, we don't see any mention of localhost58 anytime after 2018-05-02 10:07:37

worker-snippet.log neb-snippet.log

Edit: I've also added the full neb log for 10:07:40->43 in-case you can spot something related to the results queue that I can't see.

neb-log-at_2018-05-02_10-07-41.log

infraweavers commented 6 years ago

We've managed to track down what is happening.

In the situation that the dupserver connection fails or has an issue adding an item to the queue it resets the client and creates a new one (https://github.com/sni/mod_gearman/blob/master/common/gearman_utils.c#L207) using the server_list from the dupserver client. However create_client explicitly overrides current_client with the value it's created. So basically after a failure on the dupserver the dupserver becomes the server that receives all the active checks and the original server never receives results again.

We'll submit a PR fixing this tomorrow @sni