naemon / naemon-core

Networks, Applications and Event Monitor
http://www.naemon.io/
GNU General Public License v2.0
154 stars 63 forks source link

disabled host_checks and mod_gearman: seg_fault in 'is_host_member_of_hostgroup()' #131

Closed anfoe1111 closed 8 years ago

anfoe1111 commented 8 years ago

When broker_module für mod_gearman (3.0b1) is enabled, naemon fails with seg_fault in 'is_host_member_of_hostgroup()', when one of the configured hosts has disabled its host_checks (check_interval 0). WIthout mod_gearman broker enabled, naemon is running fine.

gdb Backtrace:

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7b71094 in is_host_member_of_hostgroup () from /usr/lib64/naemon/libnaemon.so.0
(gdb) bt
#0  0x00007ffff7b71094 in is_host_member_of_hostgroup () from /usr/lib64/naemon/libnaemon.so.0
#1  0x00007ffff66af459 in set_target_queue (hst=hst@entry=0x642bd0, svc=svc@entry=0x0) at neb_module_naemon/../neb_module/mod_gearman.c:1106
#2  0x00007ffff66aff14 in handle_host_check (event_type=<optimized out>, data=0x7fffffffd9b0) at neb_module_naemon/../neb_module/mod_gearman.c:607
#3  0x00007ffff7b6807f in neb_make_callbacks () from /usr/lib64/naemon/libnaemon.so.0
#4  0x00007ffff7b485cb in broker_host_check () from /usr/lib64/naemon/libnaemon.so.0
#5  0x00007ffff7b4ccc6 in ?? () from /usr/lib64/naemon/libnaemon.so.0
#6  0x00007ffff7b4d1b7 in ?? () from /usr/lib64/naemon/libnaemon.so.0
#7  0x00007ffff7b609b7 in ?? () from /usr/lib64/naemon/libnaemon.so.0
#8  0x00007ffff7b60ece in event_poll () from /usr/lib64/naemon/libnaemon.so.0
#9  0x0000000000403345 in main ()
catharsis commented 8 years ago

@anfoe1111 Which version of naemon are you running? If you're running the latest release (1.0.3), I'd recommend you'd try to reproduce the issue with the latest git version, since we've swapped out a bunch of the internal data structures since.

anfoe1111 commented 8 years ago

@catharsis I use version 1.04 loaded from the consol_labs testing repository. Do you recommend to use the latest git-version instead?

catharsis commented 8 years ago

That should be recent enough, I think - but like the testing name implies, it's not stable, and I'm not sure if mod_gearman has been updated to deal with the API changes since 1.0.3. The API/ABI stability levels are not very well defined as of right now, since we're still making a lot of changes to both, so it wouldn't surprise me if that is the culprit of the error you're experiencing.

Maybe @sni has some insights on this?

sni commented 8 years ago

Mod-Gearman uses the latest git version of naemon right now, so there should be no remaining changes. I was wondering why the mod-gearman callback is run at all if check_interval is zero.

anfoe1111 commented 8 years ago

In the meantime I did further testing: segfault error occurs even if check_interval is set to non-zero when using mod_gearman. If check_interval is set to zero, the error occurs immediately and naemon does not start up. If check_interval is set to something > 0, it will take some minutes before naemon breaks. Backtrace shows the same function all the time (see above), Now I try to reduce our config as much as possible to find the painpoint.

catharsis commented 8 years ago

@anfoe1111 Have you made any progress in finding the culprit of this?

anfoe1111 commented 8 years ago

I'll continue testing this week and will give you an update asap.

anfoe1111 commented 8 years ago

Hi all, after reinstalling naemon, thruk and mod_gearman using Consol-labs test-repo and after some weeks of testing I was not able to reproduce the error again. Thanks for your help and sorry for any inconvenience. This issue can be closed. Regards, Andreas