naemon / naemon-core

Networks, Applications and Event Monitor
http://www.naemon.io/
GNU General Public License v2.0
153 stars 63 forks source link

SIGSEGV #31

Closed ofalk closed 9 years ago

ofalk commented 10 years ago

Hi!

Today I tried the lastest (testing) version of omd-nc (new cores), which includes naemon. Unfortunately, after a minute (or so), naemon ends itself with a SIGSEGV:

[1390996174] Caught SIGSEGV, shutting down...

I guess this will be quite hard to reproduce for you, but I'm totally willing to support you in any way that doesn't (really) compromise my security!

Best, Oliver

sni commented 10 years ago

what kind of neb modules have you loaded?

ofalk commented 10 years ago

Hi! There is nothing, except gearman. It's a OMD installation.

catharsis commented 10 years ago

@ofalk From your description, it sounds like it's reproducible enough. Think you could run it through gdb and get us a backtrace?

A coredump would, of course, be even better - but I guess that'd be hard to justify if you've got sensitive data in your system.

sni commented 10 years ago

I assume you have livestatus loaded too since you have OMD? Then its probably a known issue with external commands in combination of livestatus with other NEB modules.

sni commented 10 years ago

this should be fixed now. Livestatus now uses the Queryhandler to submit commands. Please verify if it still fails.

ofalk commented 10 years ago

It still dies with SIGSEGV. Just tried with omd-1.01-nc.20140129-rh61-32.i386.rpm

ofalk commented 10 years ago

The last few lines (out of strace): [pid 26402] time(NULL) = 1391673218 [pid 26402] gettimeofday({1391673218, 730286}, NULL) = 0 [pid 26402] gettimeofday({1391673218, 730507}, NULL) = 0 [pid 26402] gettimeofday({1391673218, 730702}, NULL) = 0 [pid 26402] gettimeofday({1391673218, 730878}, NULL) = 0 [pid 26402] time(NULL) = 1391673218 [pid 26402] time(NULL) = 1391673218 [pid 26402] --- SIGSEGV (Segmentation fault) @ 0 (0) --- [pid 26402] time(NULL) = 1391673218 [pid 26402] write(5, "[1391673218] Caught SIGSEGV, shutting down...\n", 46) = 46 [pid 26402] gettimeofday({1391673218, 732218}, NULL) = 0 [pid 26402] sigreturn() = ? (mask now []) [pid 26402] --- SIGSEGV (Segmentation fault) @ 0 (0) --- [pid 26402] write(13, "=0.729ms;;;; \n\tHOSTCHECKCOMMAND::check-host-alive!(null)\tHOSTSTATE::0\tHOSTSTATETYPE::1\nDATATYPE::SERVICEPERFDATA\tTIMET::1391673218\tHOSTNAME::XXX\tSERVICEDESC::Updates\tSERVICEPERFDATA::total_updates=0;0;0 security_updates=0;0;0\n\tSERVICECHECKCOMMAND::check_yumupdates\tSERVICESTATE::0\tSERVICESTATETYPE::1\nDATATYPE::SERVICEPERFDATA\tTIMET::1391673218\tHOSTNAME::YYY\tSERVICEDESC::VMFS\tSERVICEPERFDATA::DS_VMFS1=1007552.00MB;; DS_VMFS2=1051093.00MB;; DS_VMFS3=114190.00MB;; DS_VMFS4=1052382.00MB;; DS_VMFS_XXX=413753.00MB;; DS_VMFS_XXX2=189374.00MB;; DS_VMFS5=95775.00MB;; VMFS_backup1=148530.00MB;; DS_VMFS6=103920.00MB;; VMFS_local_esx06=831280.00MB;; DS_VMFS7=602473.00MB;;\n\tSERVICECHECKCOMMAND::check_esx!-D $HOSTADDRESS$ -l vmfs\tSERVICESTATE::0\tSERVICESTATETYPE::1\nDATATYPE::SERVICEPERFDATA\tTIMET::1391673218\tHOSTNAME::XXX\tSERVICEDESC::DskUsg/boot\tSERVICEPERFDATA::usg=79.69;90;95;0; usgABS=118534;133867.8;141304.9;0;\n\tSERVICECHECKCOMMAND::check_snmp_dskusg!/boot!90!95\tSERVICESTATE::0\tSERVICESTATETYPE::1\nDATATYPE::SERVICEPERFDATA\tTIMET::1391673218\tHOSTNAME::ZZZ\tSERVICEDESC::MySQL-tmp-disk-tables\tSERVICEPERFDATA::pct_tmp_table_on_disk=99.82%;25;50 pct_tmp_table_on_disk_now=100.00%\n\tSERVICECHECKCOMMAND::check_mysql_health!--mode tmp-disk-tables\tSERVICESTATE::2\tSERVICESTATETYPE::1\nDATATYPE::SERVICEPERFDATA\tTIMET::1391673218\tHOSTNAME::AAA\tSERVICEDESC::ISL-0\tSERVICEPERFDATA::stat_wtx=4195436;0;0;0;0 stat_wrx=7482972;0;0;0;0 stat_ftx=246781;0;0;0;0 stat_frx=493438;0;0;0;0 er_enc_in=0;0;0;0;0 er_crc=0;0;0;0;0 er_trunc=0;0;0;0;0 er_toolong=0;0;0;0;0 er_bad_eof=0;0;0;0;0 er_enc_out=0;0;0;0;0 er_c3_timeout=0;0;0;0;0\tSERVICECHECKCOMMAND::check_snmp_brocade_fcport!3\tSERVICESTATE::0\tSERVICESTATETYPE::1\nDATATYPE::SERVICEPERFDATA\tTIMET::1391673218\tHOSTNAME::YYY\tSERVICEDESC::Runtime Listhost\tSERVICEPERFDATA::hostcount=3units;;\n\tSERVICECHECKCOMMAND::check_esx!-D $HOSTADDRESS$ -l runtime -s listhost\tSERVICESTATE::0\tSERVICESTATETYPE::1\nDATATYPE::SERVICEPERFDATA\tTIMET::1391673218\tHOSTNAME::ZZZ\tSERVICEDESC::traps\tSERVICEPERFDATA::rta=6.633ms;3000.000;5000.000;0; pl=0%;80;100;; rtmax=19.808ms;;;; rtmin=1.501ms;;;; \n\tSERVICECHECKCOMMAND::check-host-alive\tSERVICESTATE::0\tSERVICESTATETYPE::1\nDATATYPE::SERVICEPERFDATA\tTIMET::1391673218\tHOSTNAME::XXX\tSERVICEDESC::IOS Config\tSERVICEPERFDATA::size=1078B\n\tSERVICECHECKCOMMAND::check_cisco_config\tSERVICESTATE::0\tSERVICESTATETYPE::1\nDATATYPE::SERVICEPERFDATA\tTIMET::1391673218\tHOSTNAME::BBB\tSERVICEDESC::TCP stats\tSERVICEPERFDATA::'TCP stats'=61637c TCP-MIB::tcpPassiveOpens.0=693716c TCP-MIB::tcpInSegs.0=18214605c TCP-MIB::tcpOutSegs.0=20288240c TCP-MIB::tcpRetransSegs.0=138869c \n\tSERVICECHECKCOMMAND::tcp_stats\tSERVICESTATE::0\tSERVICESTATETYPE::1\n", 2737) = 2737

(partially obfuscated).

ofalk commented 10 years ago

backtrace: Program received signal SIGSEGV, Segmentation fault. 0x0091b503 in strchr () from /lib/libc.so.6 (gdb) bt

0 0x0091b503 in strchr () from /lib/libc.so.6

1 0x080594c9 in parse_output (buf=0xb4e707e8 "", check_output=0x9e9fa90) at checks.c:2926

2 0x0805971b in parse_check_output (buf=0xb4e707e8 "", short_output=0x9eda274, long_output=0x9eda278,

perf_data=0x9eda27c, escape_newlines_please=1, newlines_are_escaped=0) at checks.c:3000

3 0x0805e1a1 in handle_async_service_check_result (temp_service=0x9eda190, queued_check_result=0xb4e73908)

at checks.c:427

4 0x0809293d in process_check_result (cr=0xb4e73908) at utils.c:1895

5 0x005322d2 in handle_timed_events () from /omd/sites/prod/lib/mod_gearman/mod_gearman2.o

6 0x080805a7 in neb_make_callbacks (callback_type=1, data=0xbffacc18) at nebmods.c:518

7 0x08058b29 in broker_timed_event (type=202, flags=0, attr=0, event=0x9e539a8, timestamp=0x0) at broker.c:65

8 0x0807198d in handle_timed_event (event=0x9e539a8) at events.c:1127

9 0x080753f3 in event_execution_loop () at events.c:1088

10 0x0807f923 in main (argc=3, argv=0xbffad014) at naemon.c:768

topinet commented 10 years ago

Same issue, easily reproducible on Debian Wheezy using labs.consol.de repository.

Versions are: ii gearman-job-server 0.33-2 amd64 Job server for the Gearman distributed job queue ii libgearman7 0.33-1 amd64 Library providing Gearman client and worker functions ii mod-gearman-module 1.4.14 amd64 Event broker module to distribute service checks. ii mod-gearman-tools 1.4.14 amd64 Tools for mod-gearman ii naemon 0.8.1-20140425 amd64 A host/service/network monitoring and management system ii naemon-core 0.8.1-20140425 amd64 contains the Naemon core ii naemon-livestatus 0.8.1-20140425 amd64 contains the Naemon livestatus eventbroker module ii naemon-thruk 0.8.1-20140425 amd64 This package contains the thruk gui for Naemon ii naemon-thruk-libs 0.8.1-20140425 amd64 This package contains the thruk gui for Naemon ii naemon-thruk-reporting 0.8.1-20140425 amd64 This package contains the reporting addon for naemons thruk gui useful for ii naemon-tools 0.8.1-20140425 amd64 contains tools for the Naemon core

It's suposed to be fixed in this version?

pengi commented 10 years ago

I've always interpreted this as a mod_gearman issue, and thus ignored it. But shouldn't we close it, and point to sni/mod_gearman instead?