Closed wleese closed 7 years ago
Far from being knowledgeable on the topic, but it seems that the actual crash is happening in mod_gearman, so I've created yet another issue (sorry!) there.
I also had some crashes (mainly with 4.3.2) when using mod_gearman. Compiling my own mod_gearman-3.0.5 after having replaced the include-files in includes/nagios4 and include/nagios4/lib with the latest from the Nagios 4.3.2 sources seems to have solved the issue. As of today it has been running for about two days without issue (before I had a restart about every second minute). -- D/\N
@dan-m-joh
[1501066933] mod_gearman: initialized version 3.0.5 (libgearman 1.1.12)
[1501066933] Event broker module '/usr/lib64/mod_gearman/mod_gearman_nagios4.o' initialized successfully.
snip
[1501067007] Caught SIGTERM, shutting down...
Ran the submit passive result again and it's broken again. I'm adding the new backtrace to https://github.com/sni/mod_gearman/issues/122
https://github.com/sni/mod_gearman/issues/122#issuecomment-318308844
The problem here is that livestatus, calls process_external_command1 from a separate thread which then corrupts the memory. The Naemon/Nagios core is not threadsafe. The workaround was
to send the command to the queryhandler, so Naemon can read/process the command from inside the main thread.
A nice workaround would be to allow Thruk to use the queryhandler for submitting passive results (and maybe more). Is this already possible?
Thats not possible right now and i am not sure if its a good idea at all. Right now you have a single connection via livestatus to do everything. Using a different connection for each tasks makes things more complicated and error prone. And it would work only with local instances anyway. The only advantage would be, that the query handler gives you an actual result (or a helpful error) of the command instead of just fire/forget in livestatus.
As a simple workaround, you could disable the command via: https://thruk.org/documentation/configuration.html#command_disabled
The only advantage would be
well, that and a Thruk + livestatus + mod_gearman + nagios4 stack would not crash when using the submit a passive result using thruk ;)
As a simple workaround, you could disable the command via:
Thanks for the tip, but we have hundreds of users, > 100.000 services and have written lots and lots of glue around these tools. Disabling functionality because it became unstable due to a new version of Nagios (3->4) isn't an option.
You can still patch livestatus yourself of course
That's another option.
Having issues with the patch - or rather getting this:
Error: Could not load module '/usr/lib64/check_mk/livestatus.o' -> /usr/lib64/check_mk/livestatus.o: undefined symbol: _Z16nsock_printf_nuliPKcz
Somehow simply applying the patch isn't sufficient. I've also tried syncing the headers from nagios4, but that didn't help either. I could confirm that the installed libnagios.a library has the nsock_printf_nul symbol, but do not know how to make the code make use of it. Tried things like:
but no go.
Nor this:
g++-5 -L/usr/lib64/nagios -lnagios -s -fPIC -shared livestatus_so-AndingFilter.o livestatus_so-ClientQueue.o livestatus_so-Column.o livestatus_so-ColumnsColumn.o livestatus_so-CustomVarsExplicitColumn.o livestatus_so-ContactsColumn.o livestatus_so-CustomVarsColumn.o livestatus_so-CustomVarsFilter.o livestatus_so-DoubleColumn.o livestatus_so-DoubleColumnFilter.o livestatus_so-DowntimeOrComment.o livestatus_so-DownCommColumn.o livestatus_so-DynamicColumn.o livestatus_so-EmptyColumn.o livestatus_so-NullColumn.o livestatus_so-Filter.o livestatus_so-GlobalCountersColumn.o livestatus_so-HostContactsColumn.o livestatus_so-HostgroupsColumn.o livestatus_so-HostlistColumn.o livestatus_so-HostlistColumnFilter.o livestatus_so-HostlistStateColumn.o livestatus_so-MetricsColumn.o livestatus_so-HostSpecialIntColumn.o livestatus_so-ServiceSpecialIntColumn.o livestatus_so-InputBuffer.o livestatus_so-IntColumn.o livestatus_so-IntColumnFilter.o livestatus_so-ListColumn.o livestatus_so-ListColumnFilter.o livestatus_so-OffsetDoubleColumn.o livestatus_so-OffsetIntColumn.o livestatus_so-OffsetStringColumn.o livestatus_so-OffsetTimeperiodColumn.o livestatus_so-OringFilter.o livestatus_so-OutputBuffer.o livestatus_so-OffsetTimeColumn.o livestatus_so-TimePointerColumn.o livestatus_so-TimeColumnFilter.o livestatus_so-PerfdataAggregator.o livestatus_so-Query.o livestatus_so-ServiceContactsColumn.o livestatus_so-ServicegroupsColumn.o livestatus_so-ServicelistColumn.o livestatus_so-ServicelistColumnFilter.o livestatus_so-ServicelistStateColumn.o livestatus_so-store_c.o livestatus_so-Store.o livestatus_so-StringColumn.o livestatus_so-StringColumnFilter.o livestatus_so-strutil.o livestatus_so-Table.o livestatus_so-TableColumns.o livestatus_so-StatusSpecialIntColumn.o livestatus_so-HostSpecialDoubleColumn.o livestatus_so-TableCommands.o livestatus_so-TableContacts.o livestatus_so-TableDownComm.o livestatus_so-TableHostgroups.o livestatus_so-ServiceSpecialDoubleColumn.o livestatus_so-TableHosts.o livestatus_so-TableServicegroups.o livestatus_so-TableServices.o livestatus_so-TableStatus.o livestatus_so-LogEntry.o livestatus_so-LogCache.o livestatus_so-Logfile.o livestatus_so-TableStateHistory.o livestatus_so-TableLog.o livestatus_so-TableTimeperiods.o livestatus_so-TableContactgroups.o livestatus_so-ContactgroupsMemberColumn.o livestatus_so-OffsetStringMacroColumn.o livestatus_so-OffsetStringServiceMacroColumn.o livestatus_so-OffsetStringHostMacroColumn.o livestatus_so-StatsColumn.o livestatus_so-IntAggregator.o livestatus_so-CountAggregator.o livestatus_so-DoubleAggregator.o livestatus_so-AttributelistColumn.o livestatus_so-AttributelistFilter.o livestatus_so-BlobColumn.o livestatus_so-HostFileColumn.o livestatus_so-global_counters.o livestatus_so-module.o livestatus_so-logger.o livestatus_so-waittriggers.o livestatus_so-TimeperiodsCache.o livestatus_so-pnp4nagios.o livestatus_so-mk_inventory.o livestatus_so-ContactgroupsColumn.o livestatus_so-CustomTimeperiodColumn.o livestatus_so-HostServiceState.o livestatus_so-opids.o livestatus_so-auth.o -o livestatus.o -lpthread -static-libstdc++
..with the explicit "-L/usr/lib64/nagios -lnagios". even though:
# nm /usr/lib64/nagios/libnagios.a | grep nsock_printf_nul
00000000000003f0 T nsock_printf_nul
..yet the undefined symbol error remains.
More importantly however, this is what I got back from my post to the check_mk mailinglist:
(ugly top posting)
Naemon and Nagios4 are nearly the same. As Naemon is the real Nagios4 :)
For Thruk the preferred connection type is livestatus.
The livestatus for Nagios4 is not actively maintained I think. The used header files for compilation are from Nagios 4.0.2 and that's over 3 years old.
br
Andreas
William Leese <wleese@bol.com> schrieb am Fr., 28. Juli 2017 um 07:17 Uhr:
Because we're heavily invested in nagios and have written lots of glue around it.
But I take your response as: Thruk should stop using mk livestatus to talk to Nagios4?
Seems nagios4 support isn't a priority, assuming Andreas is authoritative on the matter.
I got exactly the same error when I tried the patch on mk_livestatus. (:-( It is a loooooong time ago that I did any C-programming and I can not understand why "nsock_printf_nul" get renamed to "_Z16nsock_printf_nuliPKcz", other functions does not get renamed.
Monkey patch:
/mk-livestatus-1.2.8p25/src git:(master) ✗ diff -u Store.cc.org Store.cc
--- Store.cc.org 2017-08-01 10:55:37.555389499 +0200
+++ Store.cc 2017-08-01 12:35:21.541926533 +0200
@@ -21,6 +21,10 @@
// License along with GNU Make; see the file COPYING. If not, write
// to the Free Software Foundation, Inc., 51 Franklin St, Fifth Floor,
// Boston, MA 02110-1301 USA.
+extern "C" {
+ #include "nagios4/libnagios.h"
+}
+
#include "Store.h"
#include <string.h>
@@ -34,6 +38,7 @@
#include "global_counters.h"
#include "logger.h"
#include "strutil.h"
+#include <fstream>
// TODO(sp): Remove this hack.
#ifdef EXTERN
@@ -176,10 +181,46 @@
return output->doKeepalive();
}
+/* define a fake iobroker_register function for the libnagios call */
+static void fake_iobreg(int fdout, int fderr, void *arg) { }
+
void Store::answerCommandRequest(const char *command) {
lock_guard<mutex> lg(_command_mutex);
#ifdef NAGIOS4
- process_external_command1((char *)command);
+ const char *nagioscmd_file = "/var/spool/nagios/cmd/nagios.cmd";
+
+ if (std::ifstream(nagioscmd_file)) {
+ int BUFFER = 128;
+ char *cmd;
+ int pfd[2] = {-1, -1};
+ int pfderr[2] = {-1, -1};
+ int fake_iobregarg = 0;
+ int fd;
+ char *out = (char*)calloc(1, BUFFER);
+
+ asprintf(&cmd, "echo \"%s\" > %s", command, nagioscmd_file);
+ fd = runcmd_open(cmd, pfd, pfderr, NULL, fake_iobreg, &fake_iobregarg);
+
+ /* get the output from the stdout file descriptor into the out var */
+ read(pfd[0], out, BUFFER);
+
+ runcmd_close(fd);
+
+ if (g_debug_level > 0) {
+ logger(LG_INFO,
+ "External Command redirected to %s: %s",
+ nagioscmd_file, cmd);
+ }
+
+ /* house-keeping */
+ free(cmd);
+ free(out);
+ close(pfd[0]);
+ close(pfderr[0]);
+ close(fd);
+ } else {
+ logger(LG_INFO, "External Command file missing");
+ }
#else
int buffer_items = -1;
/* int ret = */
Similar to https://github.com/sni/Thruk/issues/559
Nagios: 4.2.4 Livestatus: check-mk-livestatus-1.2.8p24 Thruk: thruk-2.06
This sometimes causes a segfault in Nagios. It's unclear to me which component in the stack is to blame here. I assume Thruk sends this command using livestatus?
My config:
Nagios integration:
Been discussing this issue with the nagios devs here: https://github.com/NagiosEnterprises/nagioscore/issues/391