sni / Thruk

Thruk is a multibackend monitoring webinterface for Naemon, Nagios, Icinga and Shinken using the Livestatus API.
http://www.thruk.org
Other
409 stars 149 forks source link

Nagios 4 + Livestatus results in Crashes after submitting passive check #741

Closed wleese closed 7 years ago

wleese commented 7 years ago

Similar to https://github.com/sni/Thruk/issues/559

Nagios: 4.2.4 Livestatus: check-mk-livestatus-1.2.8p24 Thruk: thruk-2.06

[2017/07/26 09:52:28][xxx][INFO][Thruk] [username][Core] cmd: COMMAND [1501055548] PROCESS_SERVICE_CHECK_RESULT;hostname;servicename;0;JL: Due to flexclone|

This sometimes causes a segfault in Nagios. It's unclear to me which component in the stack is to blame here. I assume Thruk sends this command using livestatus?

My config:

<Component Thruk::Backend>
    <peer>
        name   = Core
        type   = livestatus
        <options>
            peer          = /var/spool/nagios/cmd/live
            resource_file = /etc/nagios/private/resource.cfg
       </options>
       <configtool>
            core_conf      = /etc/nagios/nagios.cfg
            obj_check_cmd  = /usr/sbin/nagios -v /etc/nagios/nagios.cfg
            obj_reload_cmd = systemctl reload nagios
       </configtool>
    </peer>
</Component>

#####################################
# Business Process
<Component Thruk::Plugin::BP>
    # Results will be send back by using the spool folder.
    # This folder should point to the 'check_result_path' of your core.
    spool_dir = /srv/nagios/checkresults

    # Save objects to this file. Content will be overwritten.
    objects_save_file = /etc/nagios/conf.d/thruk_bp_generated.cfg

    # User maintained file containing templates used for business process services.
    objects_templates_file = /etc/nagios/conf.d/thruk_bp_templates.cfg

    # Command to apply changes to the objects_save_file
    objects_reload_cmd = systemctl reload nagios

    # hooks which will be executed before or after saving.
    #pre_save_cmd   =
    #post_save_cmd  =

    # Refresh interval defines how often business processes
    # will be recalculated and refreshed. (in minutes)
    #refresh_interval = 1
</Component>

Nagios integration:

# Event broker
event_broker_options=-1
broker_module=/usr/lib64/check_mk/livestatus.o /var/spool/nagios/cmd/live max_cached_messages=10000000

Been discussing this issue with the nagios devs here: https://github.com/NagiosEnterprises/nagioscore/issues/391

wleese commented 7 years ago

Far from being knowledgeable on the topic, but it seems that the actual crash is happening in mod_gearman, so I've created yet another issue (sorry!) there.

dan-m-joh commented 7 years ago

I also had some crashes (mainly with 4.3.2) when using mod_gearman. Compiling my own mod_gearman-3.0.5 after having replaced the include-files in includes/nagios4 and include/nagios4/lib with the latest from the Nagios 4.3.2 sources seems to have solved the issue. As of today it has been running for about two days without issue (before I had a restart about every second minute). -- D/\N

wleese commented 7 years ago

@dan-m-joh

  1. downloaded mod_gearman 3.0.5 and nagios 4.2.4 sources.
  2. went through all the includes in mod_gearman and synced them with nagios 4.2.4
  3. required an additional file from the nagios includes "nwrite.h" - added that too
  4. compiled, installed, restarted nagios
[1501066933] mod_gearman: initialized version 3.0.5 (libgearman 1.1.12)
[1501066933] Event broker module '/usr/lib64/mod_gearman/mod_gearman_nagios4.o' initialized successfully.

snip

[1501067007] Caught SIGTERM, shutting down...

Ran the submit passive result again and it's broken again. I'm adding the new backtrace to https://github.com/sni/mod_gearman/issues/122

wleese commented 7 years ago

https://github.com/sni/mod_gearman/issues/122#issuecomment-318308844

The problem here is that livestatus, calls process_external_command1 from a separate thread which then corrupts the memory. The Naemon/Nagios core is not threadsafe. The workaround was
to send the command to the queryhandler, so Naemon can read/process the command from inside the main thread.
wleese commented 7 years ago

A nice workaround would be to allow Thruk to use the queryhandler for submitting passive results (and maybe more). Is this already possible?

sni commented 7 years ago

Thats not possible right now and i am not sure if its a good idea at all. Right now you have a single connection via livestatus to do everything. Using a different connection for each tasks makes things more complicated and error prone. And it would work only with local instances anyway. The only advantage would be, that the query handler gives you an actual result (or a helpful error) of the command instead of just fire/forget in livestatus.

As a simple workaround, you could disable the command via: https://thruk.org/documentation/configuration.html#command_disabled

wleese commented 7 years ago

The only advantage would be

well, that and a Thruk + livestatus + mod_gearman + nagios4 stack would not crash when using the submit a passive result using thruk ;)

As a simple workaround, you could disable the command via:

Thanks for the tip, but we have hundreds of users, > 100.000 services and have written lots and lots of glue around these tools. Disabling functionality because it became unstable due to a new version of Nagios (3->4) isn't an option.

sni commented 7 years ago

You can still patch livestatus yourself of course

wleese commented 7 years ago

That's another option.

wleese commented 7 years ago

Having issues with the patch - or rather getting this:

Error: Could not load module '/usr/lib64/check_mk/livestatus.o' -> /usr/lib64/check_mk/livestatus.o: undefined symbol: _Z16nsock_printf_nuliPKcz

Somehow simply applying the patch isn't sufficient. I've also tried syncing the headers from nagios4, but that didn't help either. I could confirm that the installed libnagios.a library has the nsock_printf_nul symbol, but do not know how to make the code make use of it. Tried things like:

but no go.

Nor this:

g++-5  -L/usr/lib64/nagios -lnagios -s -fPIC -shared livestatus_so-AndingFilter.o livestatus_so-ClientQueue.o livestatus_so-Column.o livestatus_so-ColumnsColumn.o livestatus_so-CustomVarsExplicitColumn.o livestatus_so-ContactsColumn.o livestatus_so-CustomVarsColumn.o livestatus_so-CustomVarsFilter.o livestatus_so-DoubleColumn.o livestatus_so-DoubleColumnFilter.o livestatus_so-DowntimeOrComment.o livestatus_so-DownCommColumn.o livestatus_so-DynamicColumn.o livestatus_so-EmptyColumn.o livestatus_so-NullColumn.o livestatus_so-Filter.o livestatus_so-GlobalCountersColumn.o livestatus_so-HostContactsColumn.o livestatus_so-HostgroupsColumn.o livestatus_so-HostlistColumn.o livestatus_so-HostlistColumnFilter.o livestatus_so-HostlistStateColumn.o livestatus_so-MetricsColumn.o livestatus_so-HostSpecialIntColumn.o livestatus_so-ServiceSpecialIntColumn.o livestatus_so-InputBuffer.o livestatus_so-IntColumn.o livestatus_so-IntColumnFilter.o livestatus_so-ListColumn.o livestatus_so-ListColumnFilter.o livestatus_so-OffsetDoubleColumn.o livestatus_so-OffsetIntColumn.o livestatus_so-OffsetStringColumn.o livestatus_so-OffsetTimeperiodColumn.o livestatus_so-OringFilter.o livestatus_so-OutputBuffer.o livestatus_so-OffsetTimeColumn.o livestatus_so-TimePointerColumn.o livestatus_so-TimeColumnFilter.o livestatus_so-PerfdataAggregator.o livestatus_so-Query.o livestatus_so-ServiceContactsColumn.o livestatus_so-ServicegroupsColumn.o livestatus_so-ServicelistColumn.o livestatus_so-ServicelistColumnFilter.o livestatus_so-ServicelistStateColumn.o livestatus_so-store_c.o livestatus_so-Store.o livestatus_so-StringColumn.o livestatus_so-StringColumnFilter.o livestatus_so-strutil.o livestatus_so-Table.o livestatus_so-TableColumns.o livestatus_so-StatusSpecialIntColumn.o livestatus_so-HostSpecialDoubleColumn.o livestatus_so-TableCommands.o livestatus_so-TableContacts.o livestatus_so-TableDownComm.o livestatus_so-TableHostgroups.o livestatus_so-ServiceSpecialDoubleColumn.o livestatus_so-TableHosts.o livestatus_so-TableServicegroups.o livestatus_so-TableServices.o livestatus_so-TableStatus.o livestatus_so-LogEntry.o livestatus_so-LogCache.o livestatus_so-Logfile.o livestatus_so-TableStateHistory.o livestatus_so-TableLog.o livestatus_so-TableTimeperiods.o livestatus_so-TableContactgroups.o livestatus_so-ContactgroupsMemberColumn.o livestatus_so-OffsetStringMacroColumn.o livestatus_so-OffsetStringServiceMacroColumn.o livestatus_so-OffsetStringHostMacroColumn.o livestatus_so-StatsColumn.o livestatus_so-IntAggregator.o livestatus_so-CountAggregator.o livestatus_so-DoubleAggregator.o livestatus_so-AttributelistColumn.o livestatus_so-AttributelistFilter.o livestatus_so-BlobColumn.o livestatus_so-HostFileColumn.o livestatus_so-global_counters.o livestatus_so-module.o livestatus_so-logger.o livestatus_so-waittriggers.o livestatus_so-TimeperiodsCache.o livestatus_so-pnp4nagios.o livestatus_so-mk_inventory.o livestatus_so-ContactgroupsColumn.o livestatus_so-CustomTimeperiodColumn.o livestatus_so-HostServiceState.o livestatus_so-opids.o livestatus_so-auth.o -o livestatus.o -lpthread -static-libstdc++

..with the explicit "-L/usr/lib64/nagios -lnagios". even though:

# nm /usr/lib64/nagios/libnagios.a | grep nsock_printf_nul
00000000000003f0 T nsock_printf_nul

..yet the undefined symbol error remains.

More importantly however, this is what I got back from my post to the check_mk mailinglist:

(ugly top posting)

Naemon and Nagios4 are nearly the same. As Naemon is the real Nagios4 :)
For Thruk the preferred connection type is livestatus. 
The livestatus for Nagios4 is not actively maintained I think. The used header files for compilation are from Nagios 4.0.2 and that's over 3 years old.

br
Andreas

William Leese <wleese@bol.com> schrieb am Fr., 28. Juli 2017 um 07:17 Uhr:
Because we're heavily invested in nagios and have written lots of glue around it.

But I take your response as: Thruk should stop using mk livestatus to talk to Nagios4?

Seems nagios4 support isn't a priority, assuming Andreas is authoritative on the matter.

dan-m-joh commented 7 years ago

I got exactly the same error when I tried the patch on mk_livestatus. (:-( It is a loooooong time ago that I did any C-programming and I can not understand why "nsock_printf_nul" get renamed to "_Z16nsock_printf_nuliPKcz", other functions does not get renamed.

wleese commented 7 years ago

Monkey patch:

/mk-livestatus-1.2.8p25/src git:(master) ✗ diff -u Store.cc.org Store.cc
--- Store.cc.org    2017-08-01 10:55:37.555389499 +0200
+++ Store.cc    2017-08-01 12:35:21.541926533 +0200
@@ -21,6 +21,10 @@
 // License along with GNU Make; see the file  COPYING.  If  not,  write
 // to the Free Software Foundation, Inc., 51 Franklin St,  Fifth Floor,
 // Boston, MA 02110-1301 USA.
+extern "C" {
+   #include "nagios4/libnagios.h"
+}
+

 #include "Store.h"
 #include <string.h>
@@ -34,6 +38,7 @@
 #include "global_counters.h"
 #include "logger.h"
 #include "strutil.h"
+#include <fstream>

 // TODO(sp): Remove this hack.
 #ifdef EXTERN
@@ -176,10 +181,46 @@
     return output->doKeepalive();
 }

+/* define a fake iobroker_register function for the libnagios call */
+static void fake_iobreg(int fdout, int fderr, void *arg) { }
+
 void Store::answerCommandRequest(const char *command) {
     lock_guard<mutex> lg(_command_mutex);
 #ifdef NAGIOS4
-    process_external_command1((char *)command);
+    const char *nagioscmd_file = "/var/spool/nagios/cmd/nagios.cmd";
+
+    if (std::ifstream(nagioscmd_file)) {
+       int BUFFER = 128;
+       char *cmd;
+       int pfd[2] = {-1, -1};
+       int pfderr[2] = {-1, -1};
+       int fake_iobregarg = 0;
+       int fd;
+       char *out = (char*)calloc(1, BUFFER);
+
+       asprintf(&cmd, "echo \"%s\" > %s", command, nagioscmd_file);
+       fd = runcmd_open(cmd, pfd, pfderr, NULL, fake_iobreg, &fake_iobregarg);
+
+       /* get the output from the stdout file descriptor into the out var */
+       read(pfd[0], out, BUFFER);
+
+       runcmd_close(fd);
+
+       if (g_debug_level > 0) {
+          logger(LG_INFO,
+                "External Command redirected to %s: %s",
+                nagioscmd_file, cmd);
+       }
+
+       /* house-keeping */
+       free(cmd);
+       free(out);
+       close(pfd[0]);
+       close(pfderr[0]);
+       close(fd);
+    } else {
+       logger(LG_INFO, "External Command file missing");
+    }
 #else
     int buffer_items = -1;
     /* int ret = */