Closed ojomch closed 7 years ago
A Mod-Gearman worker does not send back the result to the requesting naemon core, instead it just puts the result in the result queue (named check_results by default) and the core fetches the result from there. So if two cores are working on that queue its totally random who will receive the result.
But you can set the result queue in each of your neb modules configs to a uniq queue and than that one will be used. Just set result_queue=check_results_naemon1
in one core and result_queue=check_results_naemon2
in the other one.
Awesome @sni you really made our day and very happy to see your suggestion. BTW can we use latest version of gearman which is 1.1.15 or stick to 0.33 that is in consol labs repo. Any reason why gearmand/server are still 0.33 in consol labs?
I'd stick with 0.33. I had issues with newer releases a few years ago and not much time to look into recent releases.
Totally missed this part in the documentation, https://labs.consol.de/nagios/mod-gearman/#_configuration ''' result_queue sets the result queue. Necessary when putting jobs from several Naemon instances onto the same gearman queues. Default: check_results
result_queue=check_results_naemon1 '''
Hello, We’re attempting to use Gearman to efficiently manage the timely runs of our Naemon checks. However, we’re running into an issue where it appears the Job Server is not sending the results received from the workers back to the appropriate client (client in this case being our Naemon Servers). Here’s a brief description of how our Test set up is laid out:
All our servers are running Red Hat 7 (el7).
On our Naemon Servers, we have the following Gearman packages:
we also tried -
On our Job Server, we have the following Gearman package:
we also tried -
On our Worker nodes, we have the following Gearman packages installed:
also tried -
We verified that there is connectivity between the Clients, Job Server and the Workers.
However, we noticed there was a problem when, whenever we go to reschedule a slew of checks on either one of our Naemon servers, only a few of those checks get run successfully. Most of the checks do not run. The “Last Check Time” on those checks that don’t run always remains the same as it was before it was rescheduled. From looking at the logs, we see that the jobs from the clients are reaching the workers successfully, and the workers are running them without any issues. The issue appears to be in the flow after that. Meaning, when the workers get done running the jobs and passes their results back to the Job Server, the Job Server seems unable to link the appropriate job to the Naemon server (client) it originated from. The diagram below depicts how we have things set up in our Test environment:
Once we iron out the issue described in the previous paragraphs, we intend to have 2 Job Servers or more and have them all linked to a VIP. The VIP name/address will be referenced in the necessary gearman configurations on the Clients (Naemon Servers). For example, in our prod Environment, we will have 2 Clients (Naemon Servers), 2 Job Servers and 6 workers. The 2 Job servers will point to a VIP. The VIPs will be referenced on each of the 2 clients (Naemon Servers).
After doing a search online, we came upon a post that appears to be the closest to what we're trying to do:
However, this post was made almost 4 years ago and we're not sure if the issue described in it has been re-visited since. We’re hoping that by reviewing the explanation of our problem and the described set up we have; you can point us in the proper direction so we can go ahead and move to prod. Thank you,