sni / mod_gearman

Distribute Naemon Host/Service Checks & Eventhandler with Gearman Queues. Host/Servicegroups affinity included.
http://www.mod-gearman.org
GNU General Public License v3.0
122 stars 42 forks source link

Mod-Gearman NEB is not reconnecting after gearmand restart #73

Closed c-kr closed 9 years ago

c-kr commented 9 years ago

Mod-Gearman is working fine if we start the daemons in the right order:

2015-02-23 13:44:29  -  127.0.0.1:4730  -  v0.33
 Queue Name     | Worker Available | Jobs Waiting | Jobs Running
-----------------------------------------------------------------
 check_results  |               1  |           0  |           0
 eventhandler   |              21  |           0  |           0
 host           |              21  |           0  |           0
 service        |              21  |           0  |           6
 worker_xxxxx   |               1  |           0  |           0
-----------------------------------------------------------------

But if we restart gearmand after icinga was started we have the problem that the Mod-Gearman NEB is not reconnecting to gearmand:

2015-02-23 13:45:30  -  127.0.0.1:4730  -  v0.33
 Queue Name     | Worker Available | Jobs Waiting | Jobs Running
-----------------------------------------------------------------
 check_results  |               0  |           2  |           0
 eventhandler   |              20  |           0  |           0
 host           |              20  |           0  |           0
 service        |              20  |           0  |           0
 worker_xxxxx   |               1  |           0  |           0
-----------------------------------------------------------------

And no checks are executed. After icinga restart everthing is fine again.

We use Icinga 1.12.0 / SLES 11 SP 3 with:

I am a long time nagios <-> Mod-Gearman user and never had any issue like that. So I think it is either a problem with icinga <-> Mod-Gearman or with a new version of Mod-Gearman / gearmand.

c-kr commented 9 years ago

The bug seems related to #74. I did some research before I found the other ticket. Maybe it helps to find the bug:

Small Host / Service CFG (3 Hosts / 15 Services):

Everything seems ok...

Large Host / Service CFG (+250 Hosts / +2000 Services):

/etc/init.d/gearman.d/icingat3 stop
*** glibc detected *** /usr/bin/mod_gearman_worker: double free or corruption (!prev): 0x000000000065ba40 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x76628)[0x7fdc4db67628]
/lib64/libc.so.6(cfree+0x6c)[0x7fdc4db6c5cc]
/usr/bin/mod_gearman_worker(add_job_to_queue+0x47d)[0x4090e7]
/usr/bin/mod_gearman_worker(add_job_to_queue+0x43b)[0x4090a5]
/usr/bin/mod_gearman_worker(send_result_back+0x3c1)[0x40b3b3]
/usr/bin/mod_gearman_worker(do_exec_job+0x3d4)[0x41140b]
/usr/bin/mod_gearman_worker(get_job+0x5c2)[0x4119d5]
/usr/lib64/libgearman.so.6(_ZN10FunctionV18callbackEP14gearman_job_stPv+0x53)[0x7fdc4eaf37b3]
/usr/lib64/libgearman.so.6(gearman_worker_work+0x165)[0x7fdc4eaf9f95]
/usr/bin/mod_gearman_worker(worker_loop+0x7e)[0x411ac5]
/usr/bin/mod_gearman_worker(worker_client+0x19a)[0x411d86]
/usr/bin/mod_gearman_worker(make_new_child+0x12b)[0x4132c5]
/usr/bin/mod_gearman_worker(check_worker_population+0x215)[0x413540]
/usr/bin/mod_gearman_worker(monitor_loop+0x13)[0x4135e9]
/usr/bin/mod_gearman_worker(main+0x31d)[0x413908]
/lib64/libc.so.6(__libc_start_main+0xe6)[0x7fdc4db0fc36]
/usr/bin/mod_gearman_worker[0x405b49]
======= Memory map: ========
00400000-00420000 r-xp 00000000 08:01 439539                             /usr/bin/mod_gearman_worker
00620000-00621000 r--p 00020000 08:01 439539                             /usr/bin/mod_gearman_worker
00621000-00622000 rw-p 00021000 08:01 439539                             /usr/bin/mod_gearman_worker
00622000-006bf000 rw-p 00000000 00:00 0                                  [heap]
006bf000-00852000 rw-p 00000000 00:00 0                                  [heap]
7fdc48000000-7fdc48021000 rw-p 00000000 00:00 0
7fdc48021000-7fdc4c000000 ---p 00000000 00:00 0
7fdc4cfba000-7fdc4cfc6000 r-xp 00000000 08:01 273807                     /lib64/libnss_files-2.11.3.so
7fdc4cfc6000-7fdc4d1c5000 ---p 0000c000 08:01 273807                     /lib64/libnss_files-2.11.3.so
7fdc4d1c5000-7fdc4d1c6000 r--p 0000b000 08:01 273807                     /lib64/libnss_files-2.11.3.so
7fdc4d1c6000-7fdc4d1c7000 rw-p 0000c000 08:01 273807                     /lib64/libnss_files-2.11.3.so
7fdc4d1c7000-7fdc4d1dc000 r-xp 00000000 08:01 277382                     /lib64/libgcc_s.so.1
7fdc4d1dc000-7fdc4d3db000 ---p 00015000 08:01 277382                     /lib64/libgcc_s.so.1
7fdc4d3db000-7fdc4d3dc000 r--p 00014000 08:01 277382                     /lib64/libgcc_s.so.1
7fdc4d3dc000-7fdc4d3dd000 rw-p 00015000 08:01 277382                     /lib64/libgcc_s.so.1
7fdc4d3dd000-7fdc4d4c5000 r-xp 00000000 08:01 434117                     /usr/lib64/libstdc++.so.6.0.17
7fdc4d4c5000-7fdc4d6c4000 ---p 000e8000 08:01 434117                     /usr/lib64/libstdc++.so.6.0.17
7fdc4d6c4000-7fdc4d6cc000 r--p 000e7000 08:01 434117                     /usr/lib64/libstdc++.so.6.0.17
7fdc4d6cc000-7fdc4d6ce000 rw-p 000ef000 08:01 434117                     /usr/lib64/libstdc++.so.6.0.17
7fdc4d6ce000-7fdc4d6e3000 rw-p 00000000 00:00 0
7fdc4d6e3000-7fdc4d6eb000 r-xp 00000000 08:01 273811                     /lib64/librt-2.11.3.so
7fdc4d6eb000-7fdc4d8ea000 ---p 00008000 08:01 273811                     /lib64/librt-2.11.3.so
7fdc4d8ea000-7fdc4d8eb000 r--p 00007000 08:01 273811                     /lib64/librt-2.11.3.so
7fdc4d8eb000-7fdc4d8ec000 rw-p 00008000 08:01 273811                     /lib64/librt-2.11.3.so
7fdc4d8ec000-7fdc4d8f0000 r-xp 00000000 08:01 273816                     /lib64/libuuid.so.1.3.0
7fdc4d8f0000-7fdc4daef000 ---p 00004000 08:01 273816                     /lib64/libuuid.so.1.3.0
7fdc4daef000-7fdc4daf0000 r--p 00003000 08:01 273816                     /lib64/libuuid.so.1.3.0
7fdc4daf0000-7fdc4daf1000 rw-p 00004000 08:01 273816                     /lib64/libuuid.so.1.3.0
7fdc4daf1000-7fdc4dc5f000 r-xp 00000000 08:01 269815                     /lib64/libc-2.11.3.so
7fdc4dc5f000-7fdc4de5e000 ---p 0016e000 08:01 269815                     /lib64/libc-2.11.3.so
7fdc4de5e000-7fdc4de62000 r--p 0016d000 08:01 269815                     /lib64/libc-2.11.3.so
7fdc4de62000-7fdc4de63000 rw-p 00171000 08:01 269815                     /lib64/libc-2.11.3.so
7fdc4de63000-7fdc4de68000 rw-p 00000000 00:00 0
7fdc4de68000-7fdc4de74000 r-xp 00000000 08:01 269819                     /lib64/libcrypt-2.11.3.so
7fdc4de74000-7fdc4e073000 ---p 0000c000 08:01 269819                     /lib64/libcrypt-2.11.3.so
7fdc4e073000-7fdc4e074000 r--p 0000b000 08:01 269819                     /lib64/libcrypt-2.11.3.so
7fdc4e074000-7fdc4e075000 rw-p 0000c000 08:01 269819                     /lib64/libcrypt-2.11.3.so
7fdc4e075000-7fdc4e0a3000 rw-p 00000000 00:00 0
7fdc4e0a3000-7fdc4e0a5000 r-xp 00000000 08:01 273802                     /lib64/libdl-2.11.3.so
7fdc4e0a5000-7fdc4e2a5000 ---p 00002000 08:01 273802                     /lib64/libdl-2.11.3.so
7fdc4e2a5000-7fdc4e2a6000 r--p 00002000 08:01 273802                     /lib64/libdl-2.11.3.so
7fdc4e2a6000-7fdc4e2a7000 rw-p 00003000 08:01 273802                     /lib64/libdl-2.11.3.so
7fdc4e2a7000-7fdc4e302000 r-xp 00000000 08:01 273803                     /lib64/libm-2.11.3.so
7fdc4e302000-7fdc4e501000 ---p 0005b000 08:01 273803                     /lib64/libm-2.11.3.so
7fdc4e501000-7fdc4e502000 r--p 0005a000 08:01 273803                     /lib64/libm-2.11.3.so
7fdc4e502000-7fdc4e520000 rw-p 0005b000 08:01 273803                     /lib64/libm-2.11.3.so
7fdc4e520000-7fdc4e6bf000 r-xp 00000000 08:01 286216                     /usr/lib/perl5/5.10.0/x86_64-linux-thread-multi/CORE/libperl.so
7fdc4e6bf000-7fdc4e8bf000 ---p 0019f000 08:01 286216                     /usr/lib/perl5/5.10.0/x86_64-linux-thread-multi/CORE/libperl.so
7fdc4e8bf000-7fdc4e8c4000 r--p 0019f000 08:01 286216                     /usr/lib/perl5/5.10.0/x86_64-linux-thread-multi/CORE/libperl.so
7fdc4e8c4000-7fdc4e8c9000 rw-p 001a4000 08:01 286216                     /usr/lib/perl5/5.10.0/x86_64-linux-thread-multi/CORE/libperl.so
7fdc4e8c9000-7fdc4e8e0000 r-xp 00000000 08:01 269841                     /lib64/libpthread-2.11.3.so
7fdc4e8e0000-7fdc4eae0000 ---p 00017000 08:01 269841                     /lib64/libpthread-2.11.3.so
7fdc4eae0000-7fdc4eae1000 r--p 00017000 08:01 269841                     /lib64/libpthread-2.11.3.so
7fdc4eae1000-7fdc4eae2000 rw-p 00018000 08:01 269841                     /lib64/libpthread-2.11.3.so
7fdc4eae2000-7fdc4eae6000 rw-p 00000000 00:00 0
7fdc4eae6000-7fdc4eb02000 r-xp 00000000 08:01 439535                     /usr/lib64/libgearman.so.6.0.0
7fdc4eb02000-7fdc4ed01000 ---p 0001c000 08:01 439535                     /usr/lib64/libgearman.so.6.0.0
7fdc4ed01000-7fdc4ed02000 r--p 0001b000 08:01 439535                     /usr/lib64/libgearman.so.6.0.0
7fdc4ed02000-7fdc4ed03000 rw-p 0001c000 08:01 439535                     /usr/lib64/libgearman.so.6.0.0
7fdc4ed03000-7fdc4ed22000 r-xp 00000000 08:01 270275                     /lib64/ld-2.11.3.so
7fdc4eee5000-7fdc4eeed000 rw-p 00000000 00:00 0
7fdc4ef1e000-7fdc4ef1f000 rw-s 00000000 00:04 7831555                    /SYSV00007bb5 (deleted)
7fdc4ef1f000-7fdc4ef21000 rw-p 00000000 00:00 0
7fdc4ef21000-7fdc4ef22000 r--p 0001e000 08:01 270275                     /lib64/ld-2.11.3.so
7fdc4ef22000-7fdc4ef23000 rw-p 0001f000 08:01 270275                     /lib64/ld-2.11.3.so
7fdc4ef23000-7fdc4ef24000 rw-p 00000000 00:00 0
7ffffd243000-7ffffd28f000 rw-p 00000000 00:00 0                          [stack]
7ffffd2fd000-7ffffd2fe000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

/etc/init.d/icinga.d/icingat3 status Icinga is not running but subsystem locked

Icinga Start Process with debug enabled:

Running configuration check...OK
Stopping Icinga: Waiting for icinga to exit .Stopping icinga done.
Starting icinga: [2015-03-03 11:02:36][31706][TRACE] parse_args_line(logfile=/usr/local/icinga/instances/icingat3/var/log/mod_gearman/mod_gearman_neb.log, 1)
[2015-03-03 11:02:36][31706][TRACE] parse_args_line(server=localhost:4733, 1)
[2015-03-03 11:02:36][31706][TRACE] parse_args_line(eventhandler=yes, 1)
[2015-03-03 11:02:36][31706][TRACE] parse_args_line(services=yes, 1)
[2015-03-03 11:02:36][31706][TRACE] parse_args_line(hosts=yes, 1)
[2015-03-03 11:02:36][31706][TRACE] parse_args_line(do_hostchecks=yes, 1)
[2015-03-03 11:02:36][31706][TRACE] parse_args_line(route_eventhandler_like_checks=no, 1)
[2015-03-03 11:02:36][31706][TRACE] parse_args_line(encryption=yes, 1)
[2015-03-03 11:02:36][31706][TRACE] parse_args_line(key=XXXXXXXX, 1)
[2015-03-03 11:02:36][31706][TRACE] parse_args_line(use_uniq_jobs=on, 1)
[2015-03-03 11:02:36][31706][TRACE] parse_args_line(localhostgroups=, 1)
[2015-03-03 11:02:36][31706][TRACE] parse_args_line(localservicegroups=, 1)
[2015-03-03 11:02:36][31706][TRACE] parse_args_line(result_workers=1, 1)
[2015-03-03 11:02:36][31706][TRACE] parse_args_line(perfdata=no, 1)
[2015-03-03 11:02:36][31706][TRACE] parse_args_line(perfdata_mode=1, 1)
[2015-03-03 11:02:36][31706][TRACE] parse_args_line(orphan_host_checks=yes, 1)
[2015-03-03 11:02:36][31706][TRACE] parse_args_line(orphan_service_checks=yes, 1)
[2015-03-03 11:02:36][31706][TRACE] parse_args_line(accept_clear_results=no, 1)
Starting icinga done.
sni commented 9 years ago

This should be fixed with the latest release.