Naemon Gearman Issue - Job Server unable to send to Correct clients

ojomch commented 7 years ago

Hello, We’re attempting to use Gearman to efficiently manage the timely runs of our Naemon checks. However, we’re running into an issue where it appears the Job Server is not sending the results received from the workers back to the appropriate client (client in this case being our Naemon Servers). Here’s a brief description of how our Test set up is laid out:

2 Naemon Servers (based on the Gearman documentation, these are also referred to as clients)
1 Job Server
2 Workers

All our servers are running Red Hat 7 (el7).

On our Naemon Servers, we have the following Gearman packages:

gearmand.x86_64 1:0.33-6 @labs_consol_stable
mod_gearman.x86_64 3.0.1-1.el7.centos @labs_consol_stable

we also tried -

gearmand-0.33-6.x86_64 @labs_consol_stable

On our Job Server, we have the following Gearman package:

latest-gearmand.x86_64 1.1.15-1 @/latest-gearmand-1.1.15_x86_64

we also tried -

gearmand-0.33-6.x86_64
gearmand-server-0.33-6.x86_64

On our Worker nodes, we have the following Gearman packages installed:

gearmand.x86_64 1:0.33-6 @labs_consol_stable
mod_gearman.x86_64 3.0.1-1.el7.centos

also tried -

gearmand-0.33-6.x86_64 @labs_consol_stable

We verified that there is connectivity between the Clients, Job Server and the Workers.

However, we noticed there was a problem when, whenever we go to reschedule a slew of checks on either one of our Naemon servers, only a few of those checks get run successfully. Most of the checks do not run. The “Last Check Time” on those checks that don’t run always remains the same as it was before it was rescheduled. From looking at the logs, we see that the jobs from the clients are reaching the workers successfully, and the workers are running them without any issues. The issue appears to be in the flow after that. Meaning, when the workers get done running the jobs and passes their results back to the Job Server, the Job Server seems unable to link the appropriate job to the Naemon server (client) it originated from. The diagram below depicts how we have things set up in our Test environment: current-test-setup

Once we iron out the issue described in the previous paragraphs, we intend to have 2 Job Servers or more and have them all linked to a VIP. The VIP name/address will be referenced in the necessary gearman configurations on the Clients (Naemon Servers). For example, in our prod Environment, we will have 2 Clients (Naemon Servers), 2 Job Servers and 6 workers. The 2 Job servers will point to a VIP. The VIPs will be referenced on each of the 2 clients (Naemon Servers).

After doing a search online, we came upon a post that appears to be the closest to what we're trying to do:

https://groups.google.com/forum/#!topic/mod_gearman/clOc7w_s7e0

However, this post was made almost 4 years ago and we're not sure if the issue described in it has been re-visited since. We’re hoping that by reviewing the explanation of our problem and the described set up we have; you can point us in the proper direction so we can go ahead and move to prod. Thank you,

sni commented 7 years ago

A Mod-Gearman worker does not send back the result to the requesting naemon core, instead it just puts the result in the result queue (named check_results by default) and the core fetches the result from there. So if two cores are working on that queue its totally random who will receive the result.

But you can set the result queue in each of your neb modules configs to a uniq queue and than that one will be used. Just set result_queue=check_results_naemon1 in one core and result_queue=check_results_naemon2 in the other one.

deepak-kosaraju commented 7 years ago

Awesome @sni you really made our day and very happy to see your suggestion. BTW can we use latest version of gearman which is 1.1.15 or stick to 0.33 that is in consol labs repo. Any reason why gearmand/server are still 0.33 in consol labs?

sni commented 7 years ago

I'd stick with 0.33. I had issues with newer releases a few years ago and not much time to look into recent releases.

deepak-kosaraju commented 7 years ago

Totally missed this part in the documentation, https://labs.consol.de/nagios/mod-gearman/#_configuration ''' result_queue sets the result queue. Necessary when putting jobs from several Naemon instances onto the same gearman queues. Default: check_results

result_queue=check_results_naemon1 '''

sni / mod_gearman

Naemon Gearman Issue - Job Server unable to send to Correct clients #116