Closed wleese closed 7 years ago
I also had some crashes (mainly with 4.3.2) when using mod_gearman. Compiling my own mod_gearman-3.0.5 after having replaced the include-files in includes/nagios4 and include/nagios4/lib with the latest from the Nagios 4.3.2 sources seems to have solved the issue. As of today it has been running for about two days without issue (before I had a restart about every second minute).
@dan-m-joh
[1501066933] mod_gearman: initialized version 3.0.5 (libgearman 1.1.12)
[1501066933] Event broker module '/usr/lib64/mod_gearman/mod_gearman_nagios4.o' initialized successfully.
snip
[1501067007] Caught SIGTERM, shutting down...
Ran the submit passive result again and it's broken again.
New backtrace with more debug symbols:
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f5088521700 (LWP 16684)]
0x00007f50879b29d3 in handle_svc_check (event_type=6, data=0x7f5088520a00) at neb_module_nagios4/../neb_module/mod_gearman.c:1145
1145 gettimeofday(&core_time,NULL);
(gdb) bt
#0 0x00007f50879b29d3 in handle_svc_check (event_type=6, data=0x7f5088520a00) at neb_module_nagios4/../neb_module/mod_gearman.c:1145
#1 0x00007f5088703aff in neb_make_callbacks (callback_type=callback_type@entry=6, data=data@entry=0x7f5088520a00) at nebmods.c:529
#2 0x00007f5088701ee0 in broker_service_check (type=type@entry=701, flags=flags@entry=0, attr=attr@entry=0, svc=svc@entry=0x7f508a1d56a0, check_type=1,
start_time=..., end_time=..., cmd=cmd@entry=0x0, latency=0.79036099999999998, exectime=<optimized out>, timeout=60, early_timeout=0, retcode=0,
cmdline=cmdline@entry=0x0, timestamp=timestamp@entry=0x0, cr=cr@entry=0x7f5088520c20) at broker.c:326
#3 0x00007f508870dbd3 in handle_async_service_check_result (temp_service=temp_service@entry=0x7f508a1d56a0,
queued_check_result=queued_check_result@entry=0x7f5088520c20) at checks.c:1024
#4 0x00007f508871606e in process_passive_service_check (check_time=check_time@entry=1501067781,
host_name=host_name@entry=0x7f5068001a80 "xxxxxhostname",
svc_description=svc_description@entry=0x7f5068001bd0 "xxxxx",
return_code=return_code@entry=0, output=output@entry=0x7f50680019e0 "JL: Due to flexclone|") at commands.c:2272
#5 0x00007f5088716193 in cmd_process_service_check_result (cmd=cmd@entry=30, check_time=check_time@entry=1501067781,
args=args@entry=0x7f5068001ae0 "xxxhostname;xxxxservice;0;JL: Due to flexclone|")
at commands.c:2207
#6 0x00007f508871aa3c in process_external_command2 (cmd=cmd@entry=30, entry_time=entry_time@entry=1501067781,
args=args@entry=0x7f5068001ae0 "xxxxhostname;xxxxservice;0;JL: Due to flexclone|")
at commands.c:1215
#7 0x00007f508871bc15 in process_external_command1 (cmd=<optimized out>) at commands.c:902
#8 0x00007f5086493cd4 in Store::answerCommandRequest(char const*) () from /usr/lib64/check_mk/livestatus.o
#9 0x00007f50864942bd in Store::answerRequest(InputBuffer*, OutputBuffer*) () from /usr/lib64/check_mk/livestatus.o
#10 0x00007f50864939e6 in store_answer_request () from /usr/lib64/check_mk/livestatus.o
#11 0x00007f50864d2540 in client_thread () from /usr/lib64/check_mk/livestatus.o
#12 0x00007f508748bdc5 in start_thread () from /lib64/libpthread.so.0
#13 0x00007f5087cf976d in clone () from /lib64/libc.so.6
(gdb)
I used the same procedure (also had to add that "extra" header-file). I see one difference though - I am using an older version of libgearman:
[1500890242] mod_gearman: initialized version 3.0.5 (libgearman 0.33)
[1500890242] Event broker module '/usr/lib64/mod_gearman/mod_gearman_nagios4.o' initialized successfully.
D/\N
I see you are using Thruk. Are you using the Business-Process feature of Thruk? If yes, are you sure that you are using "spool_dir =" and not "result_backend =" in the Thruk configuration. Using "result_backend =" here crashes Nagios as well (looks like a problem with mk_livestatus when you "push" data back to Nagios, could be that mk_livestatus is compiled with the "wrong" Nagios-headers as well). D/\N
Are you using the Business-Process feature of Thruk
I honestly do not know. We use Thruk as a simple interface for Nagios to schedule downtime, enable/disable notifications and sometimes submit passive results.
are you sure that you are using "spool_dir =" and not "result_backend ="
<Component Thruk::Backend>
<peer>
name = Core
type = livestatus
<options>
peer = /var/spool/nagios/cmd/live
resource_file = /etc/nagios/private/resource.cfg
</options>
<configtool>
core_conf = /etc/nagios/nagios.cfg
obj_check_cmd = /usr/sbin/nagios -v /etc/nagios/nagios.cfg
obj_reload_cmd = systemctl reload nagios
</configtool>
</peer>
</Component>
#####################################
# Business Process
<Component Thruk::Plugin::BP>
# Results will be send back by using the spool folder.
# This folder should point to the 'check_result_path' of your core.
spool_dir = /srv/nagios/checkresults
result_backend is not configured:
# grep result_backend thruk*.conf
thruk.conf: #result_backend = sitename
thruk.conf: # and result_backend is used, the restart will be triggered as external command.
It's curious how the backtrace points to gettimeofday(). Inspecting the parameters shows no issue (for someone who doesn't know C or much debugging) :
(gdb) frame 0
#0 0x00007f50879b29d3 in handle_svc_check (event_type=6, data=0x7f5088520a00) at neb_module_nagios4/../neb_module/mod_gearman.c:1145
1145 gettimeofday(&core_time,NULL);
(gdb) print &core_time
$1 = (struct timeval *) 0x7f5088520950
(gdb) p /s core_time
$2 = {tv_sec = 4294967296, tv_usec = 8255572783}
I'm probably just not aware of gdb's quirks in this area.
If I interpret this correctly:
$2 = {tv_sec = 4294967296, tv_usec = 8255572783}
$ date -d @4294967296
Sun Feb 7 07:28:16 CET 2106
That is some time in the future ;-) And tv_usec > 999999 also does not look right.
This looks like some memory overrun somewhere. I can not say if this is from mod_gearman or Thruk.
D/\N
What's the next troubleshooting step? Would a core dump be useful? Some output from valgrind?
Sorry, I can not help you any further. I am "only" a "normal" user like you.
D/\N
I can confirm that up to broker_service_check() there is no corruption in the ds->timestamp field.
(gdb) frame 2
#2 0x00007f5088701ee0 in broker_service_check (type=type@entry=701, flags=flags@entry=0, attr=attr@entry=0, svc=svc@entry=0x7f508a1d56a0, check_type=1, start_time=..., end_time=..., cmd=cmd@entry=0x0, latency=0.79036099999999998, exectime=<optimized out>, timeout=60, early_timeout=0, retcode=0,
cmdline=cmdline@entry=0x0, timestamp=timestamp@entry=0x0, cr=cr@entry=0x7f5088520c20) at broker.c:326
326 return_code = neb_make_callbacks(NEBCALLBACK_SERVICE_CHECK_DATA, (void *)&ds);
(gdb) bt
#0 0x00007f50879b29d3 in handle_svc_check (event_type=6, data=0x7f5088520a00) at neb_module_nagios4/../neb_module/mod_gearman.c:1145
#1 0x00007f5088703aff in neb_make_callbacks (callback_type=callback_type@entry=6, data=data@entry=0x7f5088520a00) at nebmods.c:529
#2 0x00007f5088701ee0 in broker_service_check (type=type@entry=701, flags=flags@entry=0, attr=attr@entry=0, svc=svc@entry=0x7f508a1d56a0, check_type=1, start_time=..., end_time=..., cmd=cmd@entry=0x0, latency=0.79036099999999998, exectime=<optimized out>, timeout=60, early_timeout=0, retcode=0,
cmdline=cmdline@entry=0x0, timestamp=timestamp@entry=0x0, cr=cr@entry=0x7f5088520c20) at broker.c:326
#3 0x00007f508870dbd3 in handle_async_service_check_result (temp_service=temp_service@entry=0x7f508a1d56a0, queued_check_result=queued_check_result@entry=0x7f5088520c20) at checks.c:1024
#4 0x00007f508871606e in process_passive_service_check (check_time=check_time@entry=1501067781, host_name=host_name@entry=0x7f5068001a80 "xxxx", svc_description=svc_description@entry=0x7f5068001bd0 "xxxx",
return_code=return_code@entry=0, output=output@entry=0x7f50680019e0 "JL: Due to flexclone|") at commands.c:2272
#5 0x00007f5088716193 in cmd_process_service_check_result (cmd=cmd@entry=30, check_time=check_time@entry=1501067781, args=args@entry=0x7f5068001ae0 "xxx;xxx;0;JL: Due to flexclone|") at commands.c:2207
#6 0x00007f508871aa3c in process_external_command2 (cmd=cmd@entry=30, entry_time=entry_time@entry=1501067781, args=args@entry=0x7f5068001ae0 "xxx;xxx;0;JL: Due to flexclone|") at commands.c:1215
#7 0x00007f508871bc15 in process_external_command1 (cmd=<optimized out>) at commands.c:902
#8 0x00007f5086493cd4 in Store::answerCommandRequest(char const*) () from /usr/lib64/check_mk/livestatus.o
#9 0x00007f50864942bd in Store::answerRequest(InputBuffer*, OutputBuffer*) () from /usr/lib64/check_mk/livestatus.o
#10 0x00007f50864939e6 in store_answer_request () from /usr/lib64/check_mk/livestatus.o
#11 0x00007f50864d2540 in client_thread () from /usr/lib64/check_mk/livestatus.o
#12 0x00007f508748bdc5 in start_thread () from /lib64/libpthread.so.0
#13 0x00007f5087cf976d in clone () from /lib64/libc.so.6
(gdb) p /s ds
$5 = {type = 701, flags = 0, attr = 0, timestamp = {tv_sec = 1501067781, tv_usec = 790416}, host_name = 0x7f508b13dd70 "xxxx", service_description = 0x7f508a1d59f0 "xxxx", check_type = 1, current_attempt = 1, max_attempts = 3,
state_type = 1, state = 0, timeout = 60, command_name = 0x0, command_args = 0x0, command_line = 0x0, start_time = {tv_sec = 1501067781, tv_usec = 0}, end_time = {tv_sec = 1501067781, tv_usec = 0}, early_timeout = 0, execution_time = 0, latency = 0.79036099999999998, return_code = 0,
output = 0x7f5068001c60 "JL: Due to flexclone", long_output = 0x0, perf_data = 0x0, check_result_ptr = 0x7f5088520c20, object_ptr = 0x7f508a1d56a0}
(gdb)
I'll add some debug logging tomorrow to see if I can catch where it gets corrupted.
This sounds familiar, we fixed something like that a couple of years ago in naemons livestatus module: https://github.com/naemon/naemon-livestatus/commit/a72a5b59fe7ab1de5b9d18fc89489c6e0b27a5a4 There is nothing mod-gearman or thruk can do about it, this should be fixed in the livestatus module.
@sni so neb module calling a neb module causes corruption somewhere? Has an issue been created elsewhere that tracks the underlying issue?
The problem here is that livestatus, calls process_external_command1
from a separate thread which then corrupts the memory. The Naemon/Nagios core is not threadsafe. The workaround was
to send the command to the queryhandler, so Naemon can read/process the command from inside the main thread.
In other words Thruk (requires livestatus) + Nagios + mod_gearman should be an unsupported setup?
It should work mostly, just don't submit passive check results via livestatus until this is fixed.
haha, well, easier said than done. This problem didn't exist in Nagios3, so we have hundreds of users who could at any time submit a passive result :)
I have just checked the latest mk_livestatus sources (1.2.8p25 and 1.4.0p9) and in both are still using "process_external_command1((char *)command);". (:-(
Yeah, I'm monkey patching now.
Closing this as it's clearly not mod_gearman to blame. Here's my monkey patch:
/mk-livestatus-1.2.8p25/src git:(master) ✗ diff -u Store.cc.org Store.cc
--- Store.cc.org 2017-08-01 10:55:37.555389499 +0200
+++ Store.cc 2017-08-01 12:35:21.541926533 +0200
@@ -21,6 +21,10 @@
// License along with GNU Make; see the file COPYING. If not, write
// to the Free Software Foundation, Inc., 51 Franklin St, Fifth Floor,
// Boston, MA 02110-1301 USA.
+extern "C" {
+ #include "nagios4/libnagios.h"
+}
+
#include "Store.h"
#include <string.h>
@@ -34,6 +38,7 @@
#include "global_counters.h"
#include "logger.h"
#include "strutil.h"
+#include <fstream>
// TODO(sp): Remove this hack.
#ifdef EXTERN
@@ -176,10 +181,46 @@
return output->doKeepalive();
}
+/* define a fake iobroker_register function for the libnagios call */
+static void fake_iobreg(int fdout, int fderr, void *arg) { }
+
void Store::answerCommandRequest(const char *command) {
lock_guard<mutex> lg(_command_mutex);
#ifdef NAGIOS4
- process_external_command1((char *)command);
+ const char *nagioscmd_file = "/var/spool/nagios/cmd/nagios.cmd";
+
+ if (std::ifstream(nagioscmd_file)) {
+ int BUFFER = 128;
+ char *cmd;
+ int pfd[2] = {-1, -1};
+ int pfderr[2] = {-1, -1};
+ int fake_iobregarg = 0;
+ int fd;
+ char *out = (char*)calloc(1, BUFFER);
+
+ asprintf(&cmd, "echo \"%s\" > %s", command, nagioscmd_file);
+ fd = runcmd_open(cmd, pfd, pfderr, NULL, fake_iobreg, &fake_iobregarg);
+
+ /* get the output from the stdout file descriptor into the out var */
+ read(pfd[0], out, BUFFER);
+
+ runcmd_close(fd);
+
+ if (g_debug_level > 0) {
+ logger(LG_INFO,
+ "External Command redirected to %s: %s",
+ nagioscmd_file, cmd);
+ }
+
+ /* house-keeping */
+ free(cmd);
+ free(out);
+ close(pfd[0]);
+ close(pfderr[0]);
+ close(fd);
+ } else {
+ logger(LG_INFO, "External Command file missing");
+ }
#else
int buffer_items = -1;
/* int ret = */
I've been trying to track down an issue where a user submits a passive result through Thruk with a mod_gearman enabled Nagios 4.2.4 setup. I logged a Thruk issue a few moments back before I was able to reproduce the issue somewhat: https://github.com/sni/Thruk/issues/741
Here's where I went first however, which contains a fair bit of information https://github.com/NagiosEnterprises/nagioscore/issues/391
Versions:
The backtrace:
Far from being knowledgeable on the topic, but it seems that the actual crash is happening in mod_gearman.