Nagios 4.2.4 Crash with mod_gearman-3.0.4 when submitting passive result

wleese commented 7 years ago

I've been trying to track down an issue where a user submits a passive result through Thruk with a mod_gearman enabled Nagios 4.2.4 setup. I logged a Thruk issue a few moments back before I was able to reproduce the issue somewhat: https://github.com/sni/Thruk/issues/741

Here's where I went first however, which contains a fair bit of information https://github.com/NagiosEnterprises/nagioscore/issues/391

Versions:

gearmand-0.33-6.x86_64
mod_gearman-3.0.4-1.el7.centos.x86_64
libgearman-1.1.12-18.el7.x86_64
gearmand-server-0.33-6.x86_64
check-mk-livestatus-1.2.8p24-1.el7.x86_64

The backtrace:

(gdb) catch signal SIGSEGV
Catchpoint 1 (signal SIGSEGV)
(gdb) continue
Continuing.

Program received signal SIGPIPE, Broken pipe.
[Switching to Thread 0x7f463ede8700 (LWP 29888)]
0x00007f463e4f7cad in write () from /lib64/libc.so.6
(gdb) continue
Continuing.

Program received signal SIGPIPE, Broken pipe.
0x00007f463e4f7cad in write () from /lib64/libc.so.6
(gdb) continue
Continuing.
^[[A

Catchpoint 1 (signal SIGSEGV), 0x00007f463debabf3 in handle_svc_check (event_type=6, data=0x7f463ede7a00)
    at neb_module_nagios4/../neb_module/mod_gearman.c:1141
1141    neb_module_nagios4/../neb_module/mod_gearman.c: No such file or directory.
(gdb) continue
Continuing.

Catchpoint 1 (signal SIGSEGV), 0x00007f463debabf3 in handle_svc_check (event_type=6, data=0x7f463ede7a00)
    at neb_module_nagios4/../neb_module/mod_gearman.c:1141
1141    in neb_module_nagios4/../neb_module/mod_gearman.c

(gdb) bt
#0  0x00007f463debabf3 in handle_svc_check (event_type=6, data=0x7f463ede7a00) at neb_module_nagios4/../neb_module/mod_gearman.c:1141
#1  0x00007f463ef10aff in neb_make_callbacks ()
#2  0x00007f463ef0eee0 in broker_service_check ()
#3  0x00007f463ef1abd3 in handle_async_service_check_result ()
#4  0x00007f463ef2306e in process_passive_service_check ()
#5  0x00007f463ef23193 in cmd_process_service_check_result ()
#6  0x00007f463ef28c15 in process_external_command1 ()
#7  0x00007f463cca9cd4 in Store::answerCommandRequest(char const*) () from /usr/lib64/check_mk/livestatus.o
#8  0x00007f463ccaa2bd in Store::answerRequest(InputBuffer*, OutputBuffer*) () from /usr/lib64/check_mk/livestatus.o
#9  0x00007f463cca99e6 in store_answer_request () from /usr/lib64/check_mk/livestatus.o
#10 0x00007f463cce8540 in client_thread () from /usr/lib64/check_mk/livestatus.o
#11 0x00007f463dc96dc5 in start_thread () from /lib64/libpthread.so.0
#12 0x00007f463e50676d in clone () from /lib64/libc.so.6

[1501059260] Caught SIGTERM, shutting down...

Far from being knowledgeable on the topic, but it seems that the actual crash is happening in mod_gearman.

dan-m-joh commented 7 years ago

I also had some crashes (mainly with 4.3.2) when using mod_gearman. Compiling my own mod_gearman-3.0.5 after having replaced the include-files in includes/nagios4 and include/nagios4/lib with the latest from the Nagios 4.3.2 sources seems to have solved the issue. As of today it has been running for about two days without issue (before I had a restart about every second minute).

See: https://github.com/sni/mod_gearman/issues/110

wleese commented 7 years ago

@dan-m-joh

downloaded mod_gearman 3.0.5 and nagios 4.2.4 sources.
went through all the includes in mod_gearman and synced them with nagios 4.2.4
required an additional file from the nagios includes "nwrite.h" - added that too
compiled, installed, restarted nagios

[1501066933] mod_gearman: initialized version 3.0.5 (libgearman 1.1.12)
[1501066933] Event broker module '/usr/lib64/mod_gearman/mod_gearman_nagios4.o' initialized successfully.

snip

[1501067007] Caught SIGTERM, shutting down...

Ran the submit passive result again and it's broken again.

New backtrace with more debug symbols:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f5088521700 (LWP 16684)]
0x00007f50879b29d3 in handle_svc_check (event_type=6, data=0x7f5088520a00) at neb_module_nagios4/../neb_module/mod_gearman.c:1145
1145        gettimeofday(&core_time,NULL);
(gdb) bt
#0  0x00007f50879b29d3 in handle_svc_check (event_type=6, data=0x7f5088520a00) at neb_module_nagios4/../neb_module/mod_gearman.c:1145
#1  0x00007f5088703aff in neb_make_callbacks (callback_type=callback_type@entry=6, data=data@entry=0x7f5088520a00) at nebmods.c:529
#2  0x00007f5088701ee0 in broker_service_check (type=type@entry=701, flags=flags@entry=0, attr=attr@entry=0, svc=svc@entry=0x7f508a1d56a0, check_type=1, 
    start_time=..., end_time=..., cmd=cmd@entry=0x0, latency=0.79036099999999998, exectime=<optimized out>, timeout=60, early_timeout=0, retcode=0, 
    cmdline=cmdline@entry=0x0, timestamp=timestamp@entry=0x0, cr=cr@entry=0x7f5088520c20) at broker.c:326
#3  0x00007f508870dbd3 in handle_async_service_check_result (temp_service=temp_service@entry=0x7f508a1d56a0, 
    queued_check_result=queued_check_result@entry=0x7f5088520c20) at checks.c:1024
#4  0x00007f508871606e in process_passive_service_check (check_time=check_time@entry=1501067781, 
    host_name=host_name@entry=0x7f5068001a80 "xxxxxhostname", 
    svc_description=svc_description@entry=0x7f5068001bd0 "xxxxx", 
    return_code=return_code@entry=0, output=output@entry=0x7f50680019e0 "JL: Due to flexclone|") at commands.c:2272
#5  0x00007f5088716193 in cmd_process_service_check_result (cmd=cmd@entry=30, check_time=check_time@entry=1501067781, 
    args=args@entry=0x7f5068001ae0 "xxxhostname;xxxxservice;0;JL: Due to flexclone|")
    at commands.c:2207
#6  0x00007f508871aa3c in process_external_command2 (cmd=cmd@entry=30, entry_time=entry_time@entry=1501067781, 
    args=args@entry=0x7f5068001ae0 "xxxxhostname;xxxxservice;0;JL: Due to flexclone|")
    at commands.c:1215
#7  0x00007f508871bc15 in process_external_command1 (cmd=<optimized out>) at commands.c:902
#8  0x00007f5086493cd4 in Store::answerCommandRequest(char const*) () from /usr/lib64/check_mk/livestatus.o
#9  0x00007f50864942bd in Store::answerRequest(InputBuffer*, OutputBuffer*) () from /usr/lib64/check_mk/livestatus.o
#10 0x00007f50864939e6 in store_answer_request () from /usr/lib64/check_mk/livestatus.o
#11 0x00007f50864d2540 in client_thread () from /usr/lib64/check_mk/livestatus.o
#12 0x00007f508748bdc5 in start_thread () from /lib64/libpthread.so.0
#13 0x00007f5087cf976d in clone () from /lib64/libc.so.6
(gdb)

dan-m-joh commented 7 years ago

I used the same procedure (also had to add that "extra" header-file). I see one difference though - I am using an older version of libgearman:

[1500890242] mod_gearman: initialized version 3.0.5 (libgearman 0.33)
[1500890242] Event broker module '/usr/lib64/mod_gearman/mod_gearman_nagios4.o' initialized successfully.

D/\N

dan-m-joh commented 7 years ago

I see you are using Thruk. Are you using the Business-Process feature of Thruk? If yes, are you sure that you are using "spool_dir =" and not "result_backend =" in the Thruk configuration. Using "result_backend =" here crashes Nagios as well (looks like a problem with mk_livestatus when you "push" data back to Nagios, could be that mk_livestatus is compiled with the "wrong" Nagios-headers as well). D/\N

wleese commented 7 years ago

Are you using the Business-Process feature of Thruk

I honestly do not know. We use Thruk as a simple interface for Nagios to schedule downtime, enable/disable notifications and sometimes submit passive results.

are you sure that you are using "spool_dir =" and not "result_backend ="

<Component Thruk::Backend>
    <peer>
        name   = Core
        type   = livestatus
        <options>
            peer          = /var/spool/nagios/cmd/live
            resource_file = /etc/nagios/private/resource.cfg
       </options>
       <configtool>
            core_conf      = /etc/nagios/nagios.cfg
            obj_check_cmd  = /usr/sbin/nagios -v /etc/nagios/nagios.cfg
            obj_reload_cmd = systemctl reload nagios
       </configtool>
    </peer>
</Component>

#####################################
# Business Process
<Component Thruk::Plugin::BP>
    # Results will be send back by using the spool folder.
    # This folder should point to the 'check_result_path' of your core.
    spool_dir = /srv/nagios/checkresults

result_backend is not configured:

# grep result_backend thruk*.conf
thruk.conf:    #result_backend = sitename
thruk.conf:    # and result_backend is used, the restart will be triggered as external command.

It's curious how the backtrace points to gettimeofday(). Inspecting the parameters shows no issue (for someone who doesn't know C or much debugging) :

(gdb) frame 0
#0  0x00007f50879b29d3 in handle_svc_check (event_type=6, data=0x7f5088520a00) at neb_module_nagios4/../neb_module/mod_gearman.c:1145
1145        gettimeofday(&core_time,NULL);
(gdb) print &core_time
$1 = (struct timeval *) 0x7f5088520950
(gdb) p /s core_time
$2 = {tv_sec = 4294967296, tv_usec = 8255572783}

I'm probably just not aware of gdb's quirks in this area.

dan-m-joh commented 7 years ago

If I interpret this correctly:

$2 = {tv_sec = 4294967296, tv_usec = 8255572783}

$ date -d @4294967296
Sun Feb  7 07:28:16 CET 2106

That is some time in the future ;-) And tv_usec > 999999 also does not look right.

This looks like some memory overrun somewhere. I can not say if this is from mod_gearman or Thruk.

D/\N

wleese commented 7 years ago

What's the next troubleshooting step? Would a core dump be useful? Some output from valgrind?

dan-m-joh commented 7 years ago

Sorry, I can not help you any further. I am "only" a "normal" user like you.

D/\N

wleese commented 7 years ago

I can confirm that up to broker_service_check() there is no corruption in the ds->timestamp field.

(gdb) frame 2
#2  0x00007f5088701ee0 in broker_service_check (type=type@entry=701, flags=flags@entry=0, attr=attr@entry=0, svc=svc@entry=0x7f508a1d56a0, check_type=1, start_time=..., end_time=..., cmd=cmd@entry=0x0, latency=0.79036099999999998, exectime=<optimized out>, timeout=60, early_timeout=0, retcode=0, 
    cmdline=cmdline@entry=0x0, timestamp=timestamp@entry=0x0, cr=cr@entry=0x7f5088520c20) at broker.c:326
326     return_code = neb_make_callbacks(NEBCALLBACK_SERVICE_CHECK_DATA, (void *)&ds);
(gdb) bt
#0  0x00007f50879b29d3 in handle_svc_check (event_type=6, data=0x7f5088520a00) at neb_module_nagios4/../neb_module/mod_gearman.c:1145
#1  0x00007f5088703aff in neb_make_callbacks (callback_type=callback_type@entry=6, data=data@entry=0x7f5088520a00) at nebmods.c:529
#2  0x00007f5088701ee0 in broker_service_check (type=type@entry=701, flags=flags@entry=0, attr=attr@entry=0, svc=svc@entry=0x7f508a1d56a0, check_type=1, start_time=..., end_time=..., cmd=cmd@entry=0x0, latency=0.79036099999999998, exectime=<optimized out>, timeout=60, early_timeout=0, retcode=0, 
    cmdline=cmdline@entry=0x0, timestamp=timestamp@entry=0x0, cr=cr@entry=0x7f5088520c20) at broker.c:326
#3  0x00007f508870dbd3 in handle_async_service_check_result (temp_service=temp_service@entry=0x7f508a1d56a0, queued_check_result=queued_check_result@entry=0x7f5088520c20) at checks.c:1024
#4  0x00007f508871606e in process_passive_service_check (check_time=check_time@entry=1501067781, host_name=host_name@entry=0x7f5068001a80 "xxxx", svc_description=svc_description@entry=0x7f5068001bd0 "xxxx", 
    return_code=return_code@entry=0, output=output@entry=0x7f50680019e0 "JL: Due to flexclone|") at commands.c:2272
#5  0x00007f5088716193 in cmd_process_service_check_result (cmd=cmd@entry=30, check_time=check_time@entry=1501067781, args=args@entry=0x7f5068001ae0 "xxx;xxx;0;JL: Due to flexclone|") at commands.c:2207
#6  0x00007f508871aa3c in process_external_command2 (cmd=cmd@entry=30, entry_time=entry_time@entry=1501067781, args=args@entry=0x7f5068001ae0 "xxx;xxx;0;JL: Due to flexclone|") at commands.c:1215
#7  0x00007f508871bc15 in process_external_command1 (cmd=<optimized out>) at commands.c:902
#8  0x00007f5086493cd4 in Store::answerCommandRequest(char const*) () from /usr/lib64/check_mk/livestatus.o
#9  0x00007f50864942bd in Store::answerRequest(InputBuffer*, OutputBuffer*) () from /usr/lib64/check_mk/livestatus.o
#10 0x00007f50864939e6 in store_answer_request () from /usr/lib64/check_mk/livestatus.o
#11 0x00007f50864d2540 in client_thread () from /usr/lib64/check_mk/livestatus.o
#12 0x00007f508748bdc5 in start_thread () from /lib64/libpthread.so.0
#13 0x00007f5087cf976d in clone () from /lib64/libc.so.6
(gdb) p /s ds
$5 = {type = 701, flags = 0, attr = 0, timestamp = {tv_sec = 1501067781, tv_usec = 790416}, host_name = 0x7f508b13dd70 "xxxx", service_description = 0x7f508a1d59f0 "xxxx", check_type = 1, current_attempt = 1, max_attempts = 3, 
  state_type = 1, state = 0, timeout = 60, command_name = 0x0, command_args = 0x0, command_line = 0x0, start_time = {tv_sec = 1501067781, tv_usec = 0}, end_time = {tv_sec = 1501067781, tv_usec = 0}, early_timeout = 0, execution_time = 0, latency = 0.79036099999999998, return_code = 0, 
  output = 0x7f5068001c60 "JL: Due to flexclone", long_output = 0x0, perf_data = 0x0, check_result_ptr = 0x7f5088520c20, object_ptr = 0x7f508a1d56a0}
(gdb)

I'll add some debug logging tomorrow to see if I can catch where it gets corrupted.

sni commented 7 years ago

This sounds familiar, we fixed something like that a couple of years ago in naemons livestatus module: https://github.com/naemon/naemon-livestatus/commit/a72a5b59fe7ab1de5b9d18fc89489c6e0b27a5a4 There is nothing mod-gearman or thruk can do about it, this should be fixed in the livestatus module.

wleese commented 7 years ago

@sni so neb module calling a neb module causes corruption somewhere? Has an issue been created elsewhere that tracks the underlying issue?

sni commented 7 years ago

The problem here is that livestatus, calls process_external_command1 from a separate thread which then corrupts the memory. The Naemon/Nagios core is not threadsafe. The workaround was to send the command to the queryhandler, so Naemon can read/process the command from inside the main thread.

wleese commented 7 years ago

In other words Thruk (requires livestatus) + Nagios + mod_gearman should be an unsupported setup?

sni commented 7 years ago

It should work mostly, just don't submit passive check results via livestatus until this is fixed.

wleese commented 7 years ago

haha, well, easier said than done. This problem didn't exist in Nagios3, so we have hundreds of users who could at any time submit a passive result :)

dan-m-joh commented 7 years ago

I have just checked the latest mk_livestatus sources (1.2.8p25 and 1.4.0p9) and in both are still using "process_external_command1((char *)command);". (:-(

wleese commented 7 years ago

Yeah, I'm monkey patching now.

wleese commented 7 years ago

Closing this as it's clearly not mod_gearman to blame. Here's my monkey patch:

/mk-livestatus-1.2.8p25/src git:(master) ✗ diff -u Store.cc.org Store.cc
--- Store.cc.org    2017-08-01 10:55:37.555389499 +0200
+++ Store.cc    2017-08-01 12:35:21.541926533 +0200
@@ -21,6 +21,10 @@
 // License along with GNU Make; see the file  COPYING.  If  not,  write
 // to the Free Software Foundation, Inc., 51 Franklin St,  Fifth Floor,
 // Boston, MA 02110-1301 USA.
+extern "C" {
+   #include "nagios4/libnagios.h"
+}
+

 #include "Store.h"
 #include <string.h>
@@ -34,6 +38,7 @@
 #include "global_counters.h"
 #include "logger.h"
 #include "strutil.h"
+#include <fstream>

 // TODO(sp): Remove this hack.
 #ifdef EXTERN
@@ -176,10 +181,46 @@
     return output->doKeepalive();
 }

+/* define a fake iobroker_register function for the libnagios call */
+static void fake_iobreg(int fdout, int fderr, void *arg) { }
+
 void Store::answerCommandRequest(const char *command) {
     lock_guard<mutex> lg(_command_mutex);
 #ifdef NAGIOS4
-    process_external_command1((char *)command);
+    const char *nagioscmd_file = "/var/spool/nagios/cmd/nagios.cmd";
+
+    if (std::ifstream(nagioscmd_file)) {
+       int BUFFER = 128;
+       char *cmd;
+       int pfd[2] = {-1, -1};
+       int pfderr[2] = {-1, -1};
+       int fake_iobregarg = 0;
+       int fd;
+       char *out = (char*)calloc(1, BUFFER);
+
+       asprintf(&cmd, "echo \"%s\" > %s", command, nagioscmd_file);
+       fd = runcmd_open(cmd, pfd, pfderr, NULL, fake_iobreg, &fake_iobregarg);
+
+       /* get the output from the stdout file descriptor into the out var */
+       read(pfd[0], out, BUFFER);
+
+       runcmd_close(fd);
+
+       if (g_debug_level > 0) {
+          logger(LG_INFO,
+                "External Command redirected to %s: %s",
+                nagioscmd_file, cmd);
+       }
+
+       /* house-keeping */
+       free(cmd);
+       free(out);
+       close(pfd[0]);
+       close(pfderr[0]);
+       close(fd);
+    } else {
+       logger(LG_INFO, "External Command file missing");
+    }
 #else
     int buffer_items = -1;
     /* int ret = */

sni / mod_gearman

Nagios 4.2.4 Crash with mod_gearman-3.0.4 when submitting passive result #122