sni / mod_gearman

Distribute Naemon Host/Service Checks & Eventhandler with Gearman Queues. Host/Servicegroups affinity included.
http://www.mod-gearman.org
GNU General Public License v3.0
122 stars 42 forks source link

UNKNOWN exit code as CRITICAL? #81

Closed deepak-kosaraju closed 8 years ago

deepak-kosaraju commented 8 years ago

@sni Any idea why we exit following UNKNOWN as CRITICAL? https://github.com/sni/mod_gearman/blob/master/t/03-exec_checks.c#L220-L268

Example1:

Info:\n\nCRITICAL: Return code of 255 is out of bounds. (worker: gearman002.example.com)\n\n  Long Service Output: \\n[connect to address 1.2.3.4 port 5666: Connection refused\\nconnect to host 1.2.3.4 port 5666: Connection refused] \n\n  

Example2:

I, [2015-12-15T07:11:38.519553 #21661]  INFO -- : Nagios-Jira - Field Incident: 2234748 - Free Space All Disks on host1.example.com is CRITICAL, Request is: 
....
Additional Info:\n\nCRITICAL: Return code of 255 is out of bounds. (worker: gearman001.example.com)\n\n  
Long Service Output: \\n[connect to address 1.2.3.4 port 5666: No route to host\\nconnect to host 1.2.3.4 port 5666: No route to host] \n\n  
ricardomaraschini commented 8 years ago

What is the value of $? after running the plugin command? You must run the command with the same user configured for mod_gearman, followed by `echo $?' command

deepak-kosaraju commented 8 years ago

@ricardomaraschini Thanks for the comment, question is why should we exist error 255 as CRITICAL when code says these should be considered as UNKNOWN - https://github.com/sni/mod_gearman/blob/master/t/03-exec_checks.c#L220-L268

`/*****

Here is our problem. Following error has nothing to with application issue but application team is getting ticket(as we generate tickets for CRITICAL status only) and its unnecessary noise.

-bash-4.1$ id
uid=389(nagios) gid=995(nagios) groups=995(nagios),6000(psp),10005(def)
-bash-4.1$ /usr/lib64/nagios/plugins/check_nrpe -n -H 1.2.3.4 -c check_app_queue;echo -e "\nexit code: $?"
connect to address 1.2.3.4 port 5666: No route to host
connect to host 1.2.3.4 port 5666: No route to host
exit code: 255
ricardomaraschini commented 8 years ago

try to enable the option:

workaround_rc_25

on worker configuration and see what happens.

https://github.com/sni/mod_gearman/blob/master/common/check_utils.c#L380-L382

sni commented 8 years ago

i cannot really remember why, but i guess it is the same way in vanilla nagios and i try to stay as close as possible.

deepak-kosaraju commented 8 years ago

@ricardomaraschini Thanks for the tip, I gave a shot but it didn't help.

I found in the official documentation of mod_gearman

workaround_rc_25
Duplicate jobs from gearmand result sometimes in exit code 25 of plugins because they are executed twice and get killed because of using the same ressource. Sending results (when exit code is 25 ) will be skipped with this enabled. Only needed if you experience problems with plugins exiting with exit code 25 randomly. Default is off.

workaround_rc_25=off

DEBUG Output:

[2015-12-21 11:03:55][8436][TRACE] do_exec_job()
[2015-12-21 11:03:55][8436][DEBUG] got service job: field1.example.com - ISS Process
[2015-12-21 11:03:55][8436][TRACE] timeout: 120, core latency: 0
[2015-12-21 11:03:55][8436][TRACE] command: /usr/lib64/nagios/plugins/check_nrpe -n -u -t 20 -H 1.2.3.4 -c check_proc_test
[2015-12-21 11:03:55][8436][TRACE] execute_safe_command()
[2015-12-21 11:03:55][8436][TRACE] using execvp, no shell characters found
[2015-12-21 11:03:55][8436][DEBUG] check exited with exit code > 3. Exit: 255
[2015-12-21 11:03:55][8436][DEBUG] stdout:
[2015-12-21 11:03:55][8436][TRACE] send_result_back()
[2015-12-21 11:03:55][8436][TRACE] queue: check_results
[2015-12-21 11:03:55][8436][TRACE] data:
host_name=field1.example.com
core_start_time=1450721035.0
start_time=1450721035.146718
finish_time=1450721035.151759
return_code=255
exited_ok=1
service_description=ISS Process
output=(gearman001.example.com) - CRITICAL: Return code of 255 is out of bounds. (worker: nagios4.gearman001.example)\n\n[connect to address 1.2.3.4 port 5666: Connection refused\nconnect to host 1.2.3.4 port 5666: Connection refused]

[2015-12-21 11:03:55][8436][TRACE] add_job_to_queue(check_results, (null), 2, 1, 1, 1)
[2015-12-21 11:03:55][8436][TRACE] 434 --->host_name=field1.example.com
core_start_time=1450721035.0
start_time=1450721035.146718
finish_time=1450721035.151759
return_code=255
exited_ok=1
service_description=ISS Process
output=(gearman001.example.com) - CRITICAL: Return code of 255 is out of bounds. (worker: nagios4.gearman001.example)\n\n[connect to address 1.2.3.4 port 5666: Connection refused\nconnect to host 1.2.3.4 port 5666: Connection refused]

[2015-12-21 11:03:30][14682][TRACE] do_exec_job()
[2015-12-21 11:03:30][14682][DEBUG] got service job: field3.example.com - SSH
[2015-12-21 11:03:30][14682][TRACE] timeout: 120, core latency: 0
[2015-12-21 11:03:30][14682][TRACE] command: /usr/lib64/nagios/plugins/check_ssh 2.2.3.4
[2015-12-21 11:03:30][14682][TRACE] execute_safe_command()
[2015-12-21 11:03:30][14682][TRACE] using execvp, no shell characters found
[2015-12-21 11:03:30][8436][DEBUG] check exited with exit code > 3. Exit: 255
[2015-12-21 11:03:30][8436][DEBUG] stdout:
[2015-12-21 11:03:30][8436][TRACE] send_result_back()
[2015-12-21 11:03:30][8436][TRACE] queue: check_results
[2015-12-21 11:03:30][8436][TRACE] data: 
host_name=field2.example.com
core_start_time=1450721006.0
start_time=1450721006.554201
finish_time=1450721010.455504
return_code=255
exited_ok=1
service_description=Load Average
output=(gearman001.example.com) - CRITICAL: Return code of 255 is out of bounds. (worker: nagios4.gearman001.example)\n\n[connect to address 5.6.7.8 port 5666: No route to host\nconnect to host 5.6.7.8 port 5666: No route to host]

[2015-12-21 11:03:30][8436][TRACE] add_job_to_queue(check_results, (null), 2, 1, 1, 1)
[2015-12-21 11:03:30][8436][TRACE] 431 --->host_name=field2.example.com
core_start_time=1450721006.0
start_time=1450721006.554201
finish_time=1450721010.455504
return_code=255
exited_ok=1
service_description=Load Average
output=(gearman001.example.com) - CRITICAL: Return code of 255 is out of bounds. (worker: nagios4.gearman001.example)\n\n[connect to address 5.6.7.8 port 5666: No route to host\nconnect to host 5.6.7.8 port 5666: No route to host]

@sni I hope you noticed the problem we are having today, Can you help us in giving direction to patch the code to exit as unknowns for Exit code: 255.

Here is our problem. Following error has nothing to with application issue but application team is getting ticket(as we generate tickets for CRITICAL status only) and its unnecessary noise.

ricardomaraschini commented 8 years ago

as far as i can see in your logs the return code(255) is being sent to nagios(at least to the check_results queue), exactly as it would be expected.

i went further and simulate the behavior with and without the mod_gearman enabled and the result is the same, regardless if mod_gearman is enabled or not.

so, the problem definitively is not on mod_gearman itself but it goes straight away to the core's code. please, look here:

https://github.com/NagiosEnterprises/nagioscore/blob/master/base/checks.c#L365-L378

deepak-kosaraju commented 8 years ago

@ricardomaraschini Thanks for quick response, I will look at above nagios code and see what best can be done for our scenario. Really appreciate your time and tips.