Closed dan-m-joh closed 7 years ago
@hedenface do you want to have a look?
I have also done a diff between the nagios-headers that you use for nagios4 and the "real" once for nagios-4.3.1. Here is the result:
diff -r nagios4/macros.h nagios-4.3.1/include/macros.h
41c41
< #define MACRO_X_COUNT 156 /* size of macro_x[] array */
---
> #define MACRO_X_COUNT 157 /* size of macro_x[] array */
219a220
> #define MACRO_HOSTGROUPMEMBERADDRESSES 156
diff -r nagios4/nagios.h nagios-4.3.1/include/nagios.h
533c534
< void clear_service_flap(service *, double, double, double); /* handles a service that has stopped flapping */
---
> void clear_service_flap(service *, double, double, double, int); /* handles a service that has stopped flapping */
535c536
< void clear_host_flap(host *, double, double, double); /* handles a host that has stopped flapping */
---
> void clear_host_flap(host *, double, double, double, int); /* handles a host that has stopped flapping */
diff -r nagios4/nebstructs.h nagios-4.3.1/include/nebstructs.h
521a521
> char *longoutput;
diff -r nagios4/objects.h nagios-4.3.1/include/objects.h
34c34
< #define CURRENT_OBJECT_STRUCTURE_VERSION 402 /* increment when changes are made to data structures... */
---
> #define CURRENT_OBJECT_STRUCTURE_VERSION 403 /* increment when changes are made to data structures... */
diff -r nagios4/lib/libnagios.h nagios-4.3.1/lib/libnagios.h
24a25
> #include "nwrite.h"
diff -r nagios4/lib/runcmd.h nagios-4.3.1/lib/runcmd.h
105a106,113
>
> /**
> * If you're using libnagios to execute a remote command, the
> * static pid_t pids is not freed after runcmd_open
> * You can call this function when you're sure pids is no longer
> * in use, to keep down memory leaks
> */
> extern void runcmd_free_pids(void);
D/\N
I'll take a look today. @dan-m-joh Can I see your contact definitions, please?
Of cause you can... (email redacted)
###############################################################################
###############################################################################
#
# CONTACTS
#
###############################################################################
###############################################################################
define contact{
name generic-contact
service_notification_period 24x7
host_notification_period 24x7
service_notification_options w,u,c,r,f,s
host_notification_options d,u,r,f,s
service_notification_commands notify-service-by-email
host_notification_commands notify-host-by-email
register 0
}
define contact{
contact_name nagiosadmin
use generic-contact
alias Nagios Admin
email my.email@comp.org
}
###############################################################################
###############################################################################
#
# CONTACT GROUPS
#
###############################################################################
###############################################################################
define contactgroup{
contactgroup_name admins
alias Nagios Administrators
members nagiosadmin
}
This looks like it may be a Core bug. I was able to replicate with pre-built and compiled from source ModGearman modules. Keep this issue open if you want, and I'll post the relevant fix when/if discovered.
Great to hear, not that we have a bug, but that you could replicate it. Now I at least know that it is not just in my environment. OK, I'll keep this open and wait for feedback.
D/\N
We've had this issue for months. We're testing moving to Naemon, but sure wish this would work with Nagios 4 Core.
@dan-m-joh Did you by chance happen to compile mod-gearman with the proper Nagios header? I'll get it set up on Wednesday and try and get this thing fixed.
No, sorry I have had no chance to test with the "new" nagios headers. Is it just as simple as to copy the "new" nagios headers to the nagios4 header directory?
F.Y.I. Compiling mod_gearman with the Nagios-4.3.2 headers (replacing all (except epn_utils.h) headers in include/ and include/lib/ with the ones from the Nagios sources) seems to fix the issue for me. I will let it run on my test rig for a few days, than I will update my production rig.
D/\N
Was this ever fixed? I know it is closed, but there was no comment on the closing. I'm getting the same behavior with the following:
CentOS 6.9 Nagios 4.3.4 (EPEL RPMs) mod_gearman 3.0.6.20170929 (ConSol Labs RPMs) gearmand 0.33-6 (ConSol Labs RPMs)
It happened with mod_gearman 3.0.6 from the sable repo too, I moved to the testing repo to see if it was fixed. Everything works fine until I enable active checks, then it dies with SIGSEGV.
The problem is the headers that are used for compiling the binaries in the package you mention I believe @rcgreenw . What happens if you compile using the Nagios 4.3.4 headers? I suspect the issue will go away.
I haven't had a chance to try that yet, the machine really isn't set up for development. I was hoping for updated packages so I wouldn't have to build my own. I'll see if I can get everything needed to build it installed. Thanks.
We have a similar setup to rcgreenw, in terms of RPM package sources. What's the recommended solution here given we want to upgrade easily with RPMs? Can mod_gearman be enhanced to deal with nagios 4.3.x automatically? It sounds like one of the best options in order to maintain automatic RPM patching is to move to naemon, unless mod_gearman can be patched.
I was able to get an RPM built with minor modifications. I pulled from git, then removed the include/nagios4 directory and replaced it with a symlink to /usr/include/nagios (from the nagios-devel rpm). Then, I did an rpmbuild using the spec file in the support directory. There is a copy of the rpm here, but don't count on updates in the future.
http://mirror.tausd.org/tausd/RHEL/6/tausd/x86_64/mod_gearman-3.0.5-9.1.el6.x86_64.rpm
How about changing the configure script to detect /usr/include/nagios and only use the shiped nagios4 folder as fallback. And i am open to pull requests to update the nagios4 folder as well.
It sounds like mod_gearman no longer supports nagios core now the nagios core has changed its interface. I see a few options:
I'd prefer 1, because I tend to avoid compiling software encouraging sysadmins to use supported binary repositories when at all possible (e.g. consol labs' yum repo).
A cursory look at the folders in the repo suggests you already have some structure to support different neb module versions, perhaps this is an extensive of these to support the new nagios interface?
I have upgraded Nagios from 4.2.4 to 4.3.1 (luckily only on my development box) and now it crashes with a SIGSEGV / SIGTERM repeatedly (about once a minute). For me it looks like a problem when a broker_module sends data "back" to nagios.
I base this on the following facts. 1) If I disable mod_gearman in nagios.cfg, everything works OK. 2) If I enable mod_gearman in nagios.cfg, but do not use it for host-/service-checks, everything works OK. 2) If I enable mod_gearman and use it for host-/service-checks it starts crashing.
Sadly, the only thing I can see in the nagios-log are: Caught SIGSEGV, shutting down... Caught SIGTERM, shutting down...
In the debug-log I do not see anything strange. Here are my SW releases: OS: RHEL 7.3 Nagios 4.3.1 (build from source) mod_gearman 3.0.1-1 (labs.consol.de) gearmand 0.33-5 (labs.consol.de)
Running nagios under gdb I see the following when it crashes:
Program received signal SIGSEGV, Segmentation fault. clear_custom_vars (vars=vars@entry=0x7ffffffed940) at ../common/macros.c:2851 2851 my_free(this_customvariablesmember->variable_name); Missing separate debuginfos, use: debuginfo-install boost-system-1.53.0-26.el7.x86_64 gearmand-0.33-5.x86_64 glibc-2.17-157.el7_3.1.x86_64 libgcc-4.8.5-11.el7.x86_64 libstdc++-4.8.5-11.el7.x86_64 libuuid-2.23.2-33.el7.x86_64 sssd-client-1.14.0-43.el7_3.11.x86_64 (gdb) bt
0 clear_custom_vars (vars=vars@entry=0x7ffffffed940) at ../common/macros.c:2851
1 0x00005555555916bc in clear_contact_macros_r (mac=mac@entry=0x7ffffffed2e0) at ../common/macros.c:3001
2 0x00005555555918b7 in clear_volatile_macros_r (mac=mac@entry=0x7ffffffed2e0) at ../common/macros.c:2870
3 0x00007ffff64aaa9e in handle_svc_check (event_type=, data=0x7fffffffda30) at neb_module_nagios4/../neb_module/mod_gearman.c:851
4 0x000055555556bb2f in neb_make_callbacks (callback_type=callback_type@entry=6, data=data@entry=0x7fffffffda30) at nebmods.c:529
5 0x0000555555569f10 in broker_service_check (type=type@entry=704, flags=flags@entry=0, attr=attr@entry=0, svc=svc@entry=0x555555e97310, check_type=check_type@entry=0,
6 0x000055555557172f in run_async_service_check (svc=svc@entry=0x555555e97310, check_options=check_options@entry=0, latency=latency@entry=0.0008800000068731606,
7 0x0000555555571cb1 in run_scheduled_service_check (svc=svc@entry=0x555555e97310, check_options=0, latency=latency@entry=0.0008800000068731606) at checks.c:90
8 0x0000555555587adb in handle_timed_event (event=event@entry=0x555555e8fc20) at events.c:1171
9 0x0000555555588623 in event_execution_loop () at events.c:1110
10 0x0000555555568a56 in main (argc=, argv=) at nagios.c:814
I hope you see something there to help you find the issue. If you need more debugging info, I would be glad to help.
Regards, D/\N