Closed deepak-kosaraju closed 8 years ago
What is the value of $? after running the plugin command? You must run the command with the same user configured for mod_gearman, followed by `echo $?' command
@ricardomaraschini Thanks for the comment, question is why should we exist error 255 as CRITICAL when code says these should be considered as UNKNOWN - https://github.com/sni/mod_gearman/blob/master/t/03-exec_checks.c#L220-L268
`/*****
Here is our problem. Following error has nothing to with application issue but application team is getting ticket(as we generate tickets for CRITICAL status only) and its unnecessary noise.
-bash-4.1$ id
uid=389(nagios) gid=995(nagios) groups=995(nagios),6000(psp),10005(def)
-bash-4.1$ /usr/lib64/nagios/plugins/check_nrpe -n -H 1.2.3.4 -c check_app_queue;echo -e "\nexit code: $?"
connect to address 1.2.3.4 port 5666: No route to host
connect to host 1.2.3.4 port 5666: No route to host
exit code: 255
try to enable the option:
workaround_rc_25
on worker configuration and see what happens.
https://github.com/sni/mod_gearman/blob/master/common/check_utils.c#L380-L382
i cannot really remember why, but i guess it is the same way in vanilla nagios and i try to stay as close as possible.
@ricardomaraschini Thanks for the tip, I gave a shot but it didn't help.
I found in the official documentation of mod_gearman
workaround_rc_25
Duplicate jobs from gearmand result sometimes in exit code 25 of plugins because they are executed twice and get killed because of using the same ressource. Sending results (when exit code is 25 ) will be skipped with this enabled. Only needed if you experience problems with plugins exiting with exit code 25 randomly. Default is off.
workaround_rc_25=off
DEBUG Output:
[2015-12-21 11:03:55][8436][TRACE] do_exec_job()
[2015-12-21 11:03:55][8436][DEBUG] got service job: field1.example.com - ISS Process
[2015-12-21 11:03:55][8436][TRACE] timeout: 120, core latency: 0
[2015-12-21 11:03:55][8436][TRACE] command: /usr/lib64/nagios/plugins/check_nrpe -n -u -t 20 -H 1.2.3.4 -c check_proc_test
[2015-12-21 11:03:55][8436][TRACE] execute_safe_command()
[2015-12-21 11:03:55][8436][TRACE] using execvp, no shell characters found
[2015-12-21 11:03:55][8436][DEBUG] check exited with exit code > 3. Exit: 255
[2015-12-21 11:03:55][8436][DEBUG] stdout:
[2015-12-21 11:03:55][8436][TRACE] send_result_back()
[2015-12-21 11:03:55][8436][TRACE] queue: check_results
[2015-12-21 11:03:55][8436][TRACE] data:
host_name=field1.example.com
core_start_time=1450721035.0
start_time=1450721035.146718
finish_time=1450721035.151759
return_code=255
exited_ok=1
service_description=ISS Process
output=(gearman001.example.com) - CRITICAL: Return code of 255 is out of bounds. (worker: nagios4.gearman001.example)\n\n[connect to address 1.2.3.4 port 5666: Connection refused\nconnect to host 1.2.3.4 port 5666: Connection refused]
[2015-12-21 11:03:55][8436][TRACE] add_job_to_queue(check_results, (null), 2, 1, 1, 1)
[2015-12-21 11:03:55][8436][TRACE] 434 --->host_name=field1.example.com
core_start_time=1450721035.0
start_time=1450721035.146718
finish_time=1450721035.151759
return_code=255
exited_ok=1
service_description=ISS Process
output=(gearman001.example.com) - CRITICAL: Return code of 255 is out of bounds. (worker: nagios4.gearman001.example)\n\n[connect to address 1.2.3.4 port 5666: Connection refused\nconnect to host 1.2.3.4 port 5666: Connection refused]
[2015-12-21 11:03:30][14682][TRACE] do_exec_job()
[2015-12-21 11:03:30][14682][DEBUG] got service job: field3.example.com - SSH
[2015-12-21 11:03:30][14682][TRACE] timeout: 120, core latency: 0
[2015-12-21 11:03:30][14682][TRACE] command: /usr/lib64/nagios/plugins/check_ssh 2.2.3.4
[2015-12-21 11:03:30][14682][TRACE] execute_safe_command()
[2015-12-21 11:03:30][14682][TRACE] using execvp, no shell characters found
[2015-12-21 11:03:30][8436][DEBUG] check exited with exit code > 3. Exit: 255
[2015-12-21 11:03:30][8436][DEBUG] stdout:
[2015-12-21 11:03:30][8436][TRACE] send_result_back()
[2015-12-21 11:03:30][8436][TRACE] queue: check_results
[2015-12-21 11:03:30][8436][TRACE] data:
host_name=field2.example.com
core_start_time=1450721006.0
start_time=1450721006.554201
finish_time=1450721010.455504
return_code=255
exited_ok=1
service_description=Load Average
output=(gearman001.example.com) - CRITICAL: Return code of 255 is out of bounds. (worker: nagios4.gearman001.example)\n\n[connect to address 5.6.7.8 port 5666: No route to host\nconnect to host 5.6.7.8 port 5666: No route to host]
[2015-12-21 11:03:30][8436][TRACE] add_job_to_queue(check_results, (null), 2, 1, 1, 1)
[2015-12-21 11:03:30][8436][TRACE] 431 --->host_name=field2.example.com
core_start_time=1450721006.0
start_time=1450721006.554201
finish_time=1450721010.455504
return_code=255
exited_ok=1
service_description=Load Average
output=(gearman001.example.com) - CRITICAL: Return code of 255 is out of bounds. (worker: nagios4.gearman001.example)\n\n[connect to address 5.6.7.8 port 5666: No route to host\nconnect to host 5.6.7.8 port 5666: No route to host]
@sni
I hope you noticed the problem we are having today, Can you help us in giving direction to patch the code to exit as unknowns for Exit code: 255.
Here is our problem. Following error has nothing to with application issue but application team is getting ticket(as we generate tickets for CRITICAL status only) and its unnecessary noise.
as far as i can see in your logs the return code(255) is being sent to nagios(at least to the check_results queue), exactly as it would be expected.
i went further and simulate the behavior with and without the mod_gearman enabled and the result is the same, regardless if mod_gearman is enabled or not.
so, the problem definitively is not on mod_gearman itself but it goes straight away to the core's code. please, look here:
https://github.com/NagiosEnterprises/nagioscore/blob/master/base/checks.c#L365-L378
@ricardomaraschini Thanks for quick response, I will look at above nagios code and see what best can be done for our scenario. Really appreciate your time and tips.
@sni Any idea why we exit following UNKNOWN as CRITICAL? https://github.com/sni/mod_gearman/blob/master/t/03-exec_checks.c#L220-L268
Example1:
Example2: