sni / mod_gearman

Distribute Naemon Host/Service Checks & Eventhandler with Gearman Queues. Host/Servicegroups affinity included.
http://www.mod-gearman.org
GNU General Public License v3.0
122 stars 42 forks source link

Nagios 4.3.1 crashes when using mod_gearman #110

Closed dan-m-joh closed 7 years ago

dan-m-joh commented 7 years ago

I have upgraded Nagios from 4.2.4 to 4.3.1 (luckily only on my development box) and now it crashes with a SIGSEGV / SIGTERM repeatedly (about once a minute). For me it looks like a problem when a broker_module sends data "back" to nagios.

I base this on the following facts. 1) If I disable mod_gearman in nagios.cfg, everything works OK. 2) If I enable mod_gearman in nagios.cfg, but do not use it for host-/service-checks, everything works OK. 2) If I enable mod_gearman and use it for host-/service-checks it starts crashing.

Sadly, the only thing I can see in the nagios-log are: Caught SIGSEGV, shutting down... Caught SIGTERM, shutting down...

In the debug-log I do not see anything strange. Here are my SW releases: OS: RHEL 7.3 Nagios 4.3.1 (build from source) mod_gearman 3.0.1-1 (labs.consol.de) gearmand 0.33-5 (labs.consol.de)

Running nagios under gdb I see the following when it crashes:

Program received signal SIGSEGV, Segmentation fault. clear_custom_vars (vars=vars@entry=0x7ffffffed940) at ../common/macros.c:2851 2851 my_free(this_customvariablesmember->variable_name); Missing separate debuginfos, use: debuginfo-install boost-system-1.53.0-26.el7.x86_64 gearmand-0.33-5.x86_64 glibc-2.17-157.el7_3.1.x86_64 libgcc-4.8.5-11.el7.x86_64 libstdc++-4.8.5-11.el7.x86_64 libuuid-2.23.2-33.el7.x86_64 sssd-client-1.14.0-43.el7_3.11.x86_64 (gdb) bt

0 clear_custom_vars (vars=vars@entry=0x7ffffffed940) at ../common/macros.c:2851

1 0x00005555555916bc in clear_contact_macros_r (mac=mac@entry=0x7ffffffed2e0) at ../common/macros.c:3001

2 0x00005555555918b7 in clear_volatile_macros_r (mac=mac@entry=0x7ffffffed2e0) at ../common/macros.c:2870

3 0x00007ffff64aaa9e in handle_svc_check (event_type=, data=0x7fffffffda30) at neb_module_nagios4/../neb_module/mod_gearman.c:851

4 0x000055555556bb2f in neb_make_callbacks (callback_type=callback_type@entry=6, data=data@entry=0x7fffffffda30) at nebmods.c:529

5 0x0000555555569f10 in broker_service_check (type=type@entry=704, flags=flags@entry=0, attr=attr@entry=0, svc=svc@entry=0x555555e97310, check_type=check_type@entry=0,

start_time=..., end_time=..., cmd=<optimized out>, latency=0, exectime=exectime@entry=0, timeout=timeout@entry=0, early_timeout=early_timeout@entry=0, 
retcode=retcode@entry=0, cmdline=cmdline@entry=0x0, timestamp=timestamp@entry=0x0, cr=cr@entry=0x0) at broker.c:326

6 0x000055555557172f in run_async_service_check (svc=svc@entry=0x555555e97310, check_options=check_options@entry=0, latency=latency@entry=0.0008800000068731606,

scheduled_check=scheduled_check@entry=1, reschedule_check=reschedule_check@entry=1, time_is_valid=time_is_valid@entry=0x7fffffffe29c, 
preferred_time=preferred_time@entry=0x7fffffffe2a8) at checks.c:199

7 0x0000555555571cb1 in run_scheduled_service_check (svc=svc@entry=0x555555e97310, check_options=0, latency=latency@entry=0.0008800000068731606) at checks.c:90

8 0x0000555555587adb in handle_timed_event (event=event@entry=0x555555e8fc20) at events.c:1171

9 0x0000555555588623 in event_execution_loop () at events.c:1110

10 0x0000555555568a56 in main (argc=, argv=) at nagios.c:814

I hope you see something there to help you find the issue. If you need more debugging info, I would be glad to help.

Regards, D/\N

sni commented 7 years ago

@hedenface do you want to have a look?

dan-m-joh commented 7 years ago

I have also done a diff between the nagios-headers that you use for nagios4 and the "real" once for nagios-4.3.1. Here is the result:

diff -r nagios4/macros.h nagios-4.3.1/include/macros.h
41c41
< #define MACRO_X_COUNT                         156     /* size of macro_x[] array */
---
> #define MACRO_X_COUNT                         157     /* size of macro_x[] array */
219a220
> #define MACRO_HOSTGROUPMEMBERADDRESSES          156
diff -r nagios4/nagios.h nagios-4.3.1/include/nagios.h
533c534
< void clear_service_flap(service *, double, double, double);   /* handles a service that has stopped flapping */
---
> void clear_service_flap(service *, double, double, double, int);      /* handles a service that has stopped flapping */
535c536
< void clear_host_flap(host *, double, double, double);         /* handles a host that has stopped flapping */
---
> void clear_host_flap(host *, double, double, double, int);            /* handles a host that has stopped flapping */
diff -r nagios4/nebstructs.h nagios-4.3.1/include/nebstructs.h
521a521
>       char            *longoutput;
diff -r nagios4/objects.h nagios-4.3.1/include/objects.h
34c34
< #define CURRENT_OBJECT_STRUCTURE_VERSION        402     /* increment when changes are made to data structures... */
---
> #define CURRENT_OBJECT_STRUCTURE_VERSION        403     /* increment when changes are made to data structures... */
diff -r nagios4/lib/libnagios.h nagios-4.3.1/lib/libnagios.h
24a25
> #include "nwrite.h"
diff -r nagios4/lib/runcmd.h nagios-4.3.1/lib/runcmd.h
105a106,113
>
> /**
>  * If you're using libnagios to execute a remote command, the
>  * static pid_t pids is not freed after runcmd_open
>  * You can call this function when you're sure pids is no longer
>  * in use, to keep down memory leaks
>  */
> extern void runcmd_free_pids(void);

D/\N

hedenface commented 7 years ago

I'll take a look today. @dan-m-joh Can I see your contact definitions, please?

dan-m-joh commented 7 years ago

Of cause you can... (email redacted)

###############################################################################
###############################################################################
#
# CONTACTS
#
###############################################################################
###############################################################################

define contact{
        name                            generic-contact
        service_notification_period     24x7
        host_notification_period        24x7
        service_notification_options    w,u,c,r,f,s
        host_notification_options       d,u,r,f,s
        service_notification_commands   notify-service-by-email 
        host_notification_commands      notify-host-by-email    
        register                        0
        }

define contact{
        contact_name                    nagiosadmin
        use                             generic-contact
        alias                           Nagios Admin
        email                           my.email@comp.org
        }

###############################################################################
###############################################################################
#
# CONTACT GROUPS
#
###############################################################################
###############################################################################

define contactgroup{
        contactgroup_name       admins
        alias                   Nagios Administrators
        members                 nagiosadmin
        }
hedenface commented 7 years ago

This looks like it may be a Core bug. I was able to replicate with pre-built and compiled from source ModGearman modules. Keep this issue open if you want, and I'll post the relevant fix when/if discovered.

dan-m-joh commented 7 years ago

Great to hear, not that we have a bug, but that you could replicate it. Now I at least know that it is not just in my environment. OK, I'll keep this open and wait for feedback.

D/\N

BoomerET commented 7 years ago

We've had this issue for months. We're testing moving to Naemon, but sure wish this would work with Nagios 4 Core.

hedenface commented 7 years ago

@dan-m-joh Did you by chance happen to compile mod-gearman with the proper Nagios header? I'll get it set up on Wednesday and try and get this thing fixed.

dan-m-joh commented 7 years ago

No, sorry I have had no chance to test with the "new" nagios headers. Is it just as simple as to copy the "new" nagios headers to the nagios4 header directory?

dan-m-joh commented 7 years ago

F.Y.I. Compiling mod_gearman with the Nagios-4.3.2 headers (replacing all (except epn_utils.h) headers in include/ and include/lib/ with the ones from the Nagios sources) seems to fix the issue for me. I will let it run on my test rig for a few days, than I will update my production rig.

D/\N

rcgreenw commented 6 years ago

Was this ever fixed? I know it is closed, but there was no comment on the closing. I'm getting the same behavior with the following:

CentOS 6.9 Nagios 4.3.4 (EPEL RPMs) mod_gearman 3.0.6.20170929 (ConSol Labs RPMs) gearmand 0.33-6 (ConSol Labs RPMs)

It happened with mod_gearman 3.0.6 from the sable repo too, I moved to the testing repo to see if it was fixed. Everything works fine until I enable active checks, then it dies with SIGSEGV.

hedenface commented 6 years ago

The problem is the headers that are used for compiling the binaries in the package you mention I believe @rcgreenw . What happens if you compile using the Nagios 4.3.4 headers? I suspect the issue will go away.

rcgreenw commented 6 years ago

I haven't had a chance to try that yet, the machine really isn't set up for development. I was hoping for updated packages so I wouldn't have to build my own. I'll see if I can get everything needed to build it installed. Thanks.

smallsam commented 6 years ago

We have a similar setup to rcgreenw, in terms of RPM package sources. What's the recommended solution here given we want to upgrade easily with RPMs? Can mod_gearman be enhanced to deal with nagios 4.3.x automatically? It sounds like one of the best options in order to maintain automatic RPM patching is to move to naemon, unless mod_gearman can be patched.

rcgreenw commented 6 years ago

I was able to get an RPM built with minor modifications. I pulled from git, then removed the include/nagios4 directory and replaced it with a symlink to /usr/include/nagios (from the nagios-devel rpm). Then, I did an rpmbuild using the spec file in the support directory. There is a copy of the rpm here, but don't count on updates in the future.

http://mirror.tausd.org/tausd/RHEL/6/tausd/x86_64/mod_gearman-3.0.5-9.1.el6.x86_64.rpm

sni commented 6 years ago

How about changing the configure script to detect /usr/include/nagios and only use the shiped nagios4 folder as fallback. And i am open to pull requests to update the nagios4 folder as well.

smallsam commented 6 years ago

It sounds like mod_gearman no longer supports nagios core now the nagios core has changed its interface. I see a few options:

  1. Build a different module for naemon and nagios 4.x as statusengine have done with their module: https://github.com/statusengine/module/tree/master/src. The binary releases for mod_gearman could then package and distribute differently named binaries for naemon, nagios etc..
  2. Drop support for nagios core.
  3. @sni's suggestion, user can compile mod_gearman against headers of their choice.

I'd prefer 1, because I tend to avoid compiling software encouraging sysadmins to use supported binary repositories when at all possible (e.g. consol labs' yum repo).

A cursory look at the folders in the repo suggests you already have some structure to support different neb module versions, perhaps this is an extensive of these to support the new nagios interface?

sni commented 6 years ago
  1. thats the case already. We already build 3 neb modules for Nagios 3, Nagios 4 and Naemon. Nagios 3 does not change anymore, thats easy, so we just ship the headers and build against them. Naemon is easy as well, there is a naemon-devel package containing the headers and it just works. Nagios 4 is difficult and error prone due to the lack of a nagios4-devel package available for all supported systems. So we need to ship headers again but this breaks as soon as the abi changes. So right now, the only way for Nagios 4 is to compile the plugin yourself with the headers from your setup. Well, or switch to Naemon.