sni / mod_gearman

Distribute Naemon Host/Service Checks & Eventhandler with Gearman Queues. Host/Servicegroups affinity included.
http://www.mod-gearman.org
GNU General Public License v3.0
122 stars 42 forks source link

Crash Icinga #86

Closed lvasiliev closed 8 years ago

lvasiliev commented 8 years ago

Hi!

Icinga coredump when gearman job server not responding.

icinga-1.13.3_1 mod_gearman: initialized version 1.5.5 (libgearman 1.0.6)

Loaded symbols for /libexec/ld-elf.so.1
#0  0x0000000801878f5c in sbrk () from /lib/libc.so.7
[New Thread 80200c000 (LWP 101234/<unknown>)]
[New Thread 80200bc00 (LWP 101222/<unknown>)]
[New Thread 80200b800 (LWP 101079/<unknown>)]
[New Thread 80200b400 (LWP 101038/<unknown>)]
[New Thread 80200b000 (LWP 101033/<unknown>)]
[New Thread 80200ac00 (LWP 101024/<unknown>)]
[New Thread 80200a800 (LWP 101019/<unknown>)]
[New Thread 80200a400 (LWP 100947/<unknown>)]
[New Thread 80200a000 (LWP 100883/<unknown>)]
[New Thread 802009c00 (LWP 100841/<unknown>)]
[New Thread 802009800 (LWP 100813/<unknown>)]
[New Thread 802007800 (LWP 100273/<unknown>)]
[New Thread 802007400 (LWP 100224/<unknown>)]
[New Thread 802006400 (LWP 105941/<unknown>)]
(gdb) bt
#0  0x0000000801878f5c in sbrk () from /lib/libc.so.7
#1  0x0000000801862c93 in syscall () from /lib/libc.so.7
#2  0x0000000801862ac6 in syscall () from /lib/libc.so.7
#3  0x000000080187f7e4 in malloc () from /lib/libc.so.7
#4  0x0000000801917001 in reallocf () from /lib/libc.so.7
#5  0x00000008018fefb7 in __srget () from /lib/libc.so.7
#6  0x00000008018f2f48 in fgets () from /lib/libc.so.7
#7  0x00000008018c27c0 in getaddrinfo () from /lib/libc.so.7
#8  0x00000008018e8d9f in nsdispatch () from /lib/libc.so.7
#9  0x00000008018c18fc in getaddrinfo () from /lib/libc.so.7
#10 0x0000000802667132 in gearman_connection_st::lookup (this=0x80640e000) at libgearman/connection.cc:646
#11 0x0000000802667d08 in gearman_connection_st::flush (this=0x80640e000) at libgearman/connection.cc:667
#12 0x0000000802667998 in gearman_connection_st::_send_packet (this=0x80640e000, packet_arg=<value optimized out>, flush_buffer=<value optimized out>) at libgearman/connection.cc:592
#13 0x000000080266767a in gearman_connection_st::send_packet (this=0x80640e000, packet_arg=@0x806400000, flush_buffer=<value optimized out>) at libgearman/connection.cc:465
#14 0x000000080266ffe6 in gearman_worker_grab_job (worker=0x7fffdfdfc418, job=0x0) at libgearman/worker.cc:695
#15 0x000000080267045f in gearman_worker_work (worker=0x7fffdfdfc418) at libgearman/worker.cc:976
#16 0x000000080240ee00 in result_worker (data=<value optimized out>) at neb_module/result_thread.c:61
#17 0x0000000800b367c5 in pthread_create () from /lib/libthr.so.3
#18 0x0000000000000000 in ?? ()
(gdb) up 10
#10 0x0000000802667132 in gearman_connection_st::lookup (this=0x80640e000) at libgearman/connection.cc:646
646     libgearman/connection.cc: No such file or directory.
        in libgearman/connection.cc
Current language:  auto; currently c++
(gdb) info locals
port_str = "4830", '\0' <repeats 27 times>
ai = {ai_flags = 0, ai_family = 0, ai_socktype = 1, ai_protocol = 6, ai_addrlen = 0, ai_canonname = 0x0, ai_addr = 0x0, ai_next = 0x0}
port_str_length = <value optimized out>
ret = <value optimized out>
(gdb)
lvasiliev commented 8 years ago
(gdb) up 16
#16 0x000000080240ee00 in result_worker (data=<value optimized out>) at neb_module/result_thread.c:61
61              ret = gearman_worker_work( &worker );
Current language:  auto; currently minimal
(gdb) info locals
__cleanup_info__ = {pthread_cleanup_pad = {0, 34397548320, 140736949371928, 0, 0, 0, 0, 0}}
worker = {options = {allocated = false, non_blocking = false, packet_init = true, change = false, grab_uniq = true, grab_all = true, timeout_return = false}, 
  state = GEARMAN_WORKER_STATE_START, work_state = GEARMAN_WORKER_WORK_UNIVERSAL_GRAB_JOB, function_count = 2, job_count = 0, work_result_size = 0, context = 0x0, con = 0x80640e000, 
  job = 0x0, job_list = 0x0, function = 0x80641d1c0, function_list = 0x80641d1c0, work_function = 0x0, work_result = 0x0, universal = {options = {dont_track_packets = false, 
      non_blocking = false}, verbose = GEARMAN_VERBOSE_NEVER, con_count = 1, packet_count = 4, pfds_size = 0, sending = 0, timeout = -1, con_list = 0x80640e000, 
    server_options_list = 0x0, packet_list = 0x80641d1f8, pfds = 0x0, log_fn = 0, log_context = 0x0, allocator = {calloc = 0, free = 0, malloc = 0, realloc = 0, context = 0x0}, 
    _namespace = 0x0, error = {rc = GEARMAN_SUCCESS, last_errno = 0, last_error = "\000lush(Permission denied) connect -> libgearman/connection.cc:747", '\0' <repeats 1983 times>}, 
    wakeup_fd = {18, 19}}, grab_job = {options = {allocated = false, complete = true, free_data = false}, magic = GEARMAN_MAGIC_REQUEST, command = GEARMAN_COMMAND_GRAB_JOB_ALL, 
    argc = 0 '\0', args_size = 12, data_size = 0, universal = 0x7fffdfdfc478, next = 0x0, prev = 0x7fffdfdfce50, args = 0x7fffdfdfcdd0 "", data = 0x0, arg = {0x0, 0x0, 0x0, 0x0, 0x0, 
      0x0, 0x0, 0x0}, arg_size = {0, 0, 0, 0, 0, 0, 0, 0}, args_buffer = "\000REQ\000\000\000'", '\0' <repeats 119 times>}, pre_sleep = {options = {allocated = false, 
      complete = true, free_data = false}, magic = GEARMAN_MAGIC_REQUEST, command = GEARMAN_COMMAND_PRE_SLEEP, argc = 0 '\0', args_size = 12, data_size = 0, 
    universal = 0x7fffdfdfc478, next = 0x7fffdfdfcd08, prev = 0x80641d3b8, args = 0x7fffdfdfcf18 "", data = 0x0, arg = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, arg_size = {0, 0, 0, 
      0, 0, 0, 0, 0}, args_buffer = "\000REQ\000\000\000\004", '\0' <repeats 119 times>}, work_job = 0x0}
worker_num = <value optimized out>
ret = GEARMAN_TOO_MANY_ARGS
(gdb)
sni commented 8 years ago

There was a similar issue fixed in 1.5.3. However, this seems to fail deep inside libgearman. Can you update libgearman. Or downgrade? Prefered libgearman is still 0.35 which has been proven stable for some years now.

lvasiliev commented 8 years ago

In this case, gearman job server was added in ipfw firewall.

lvv@icinga:~ % sockstat | grep ':4830' | wc -l
   36875
[Thu Mar 17 15:30:06 2016] Error: Unable to open temp file '/var/spool/icinga/ramdisk/icinga.tmpLEOJcZ' for writing status data: Too many open files
[Thu Mar 17 15:30:16 2016] Error: Unable to open temp file '/var/spool/icinga/ramdisk/icinga.tmpK1vONu' for writing status data: Too many open files
[Thu Mar 17 15:30:26 2016] Error: Unable to open temp file '/var/spool/icinga/ramdisk/icinga.tmpjPEK5C' for writing status data: Too many open files
[Thu Mar 17 15:30:36 2016] Error: Unable to open temp file '/var/spool/icinga/ramdisk/icinga.tmpgFPjjm' for writing status data: Too many open files
lvasiliev commented 8 years ago

Ok, I try to use gearmand-devel-1.1.8, bot not sure that it will be ok.

lvasiliev commented 8 years ago

Same bug... mod_gearman: initialized version 1.5.5 (libgearman 1.1.8)

(gdb) bt
#0  0x0000000801878f5c in sbrk () from /lib/libc.so.7
#1  0x0000000801862c93 in syscall () from /lib/libc.so.7
#2  0x0000000801862ac6 in syscall () from /lib/libc.so.7
#3  0x000000080187f5be in malloc () from /lib/libc.so.7
#4  0x00000008018c4052 in getaddrinfo () from /lib/libc.so.7
#5  0x00000008018c3c4a in getaddrinfo () from /lib/libc.so.7
#6  0x00000008018c2ff5 in getaddrinfo () from /lib/libc.so.7
#7  0x00000008018e8d9f in nsdispatch () from /lib/libc.so.7
#8  0x00000008018c18fc in getaddrinfo () from /lib/libc.so.7
#9  0x0000000802667911 in gearman_connection_st::lookup (this=0x80642f000) at libgearman/connection.cc:683
#10 0x00000008026687c8 in gearman_connection_st::flush (this=0x80642f000) at libgearman/connection.cc:728
#11 0x000000080266845d in gearman_connection_st::_send_packet (this=0x80642f000, packet_arg=<value optimized out>, flush_buffer=<value optimized out>) at libgearman/connection.cc:638
#12 0x0000000802668128 in gearman_connection_st::send_packet (this=<value optimized out>, packet_arg=<value optimized out>, flush_buffer=<value optimized out>)
    at libgearman/connection.cc:515
#13 0x0000000802671836 in gearman_worker_grab_job (worker=0x7fffdfdfcf90, job=0x0) at libgearman/worker.cc:711
#14 0x0000000802671d52 in gearman_worker_work (worker=0x7fffdfdfcf90) at libgearman/worker.cc:993
#15 0x000000080240edf0 in result_worker (data=<value optimized out>) at neb_module/result_thread.c:61
#16 0x0000000800b367c5 in pthread_create () from /lib/libthr.so.3
#17 0x0000000000000000 in ?? ()
(gdb)
lvasiliev commented 8 years ago

I'm not sure, may be needed add condition for ret GEARMAN_TOO_MANY_ARGS into neb_module/result_thread.c ?

    while ( 1 ) {
        ret = gearman_worker_work( &worker );
        if ( ret != GEARMAN_SUCCESS && ret != GEARMAN_WORK_FAIL ) {
            if ( ret != GEARMAN_TIMEOUT)
                gm_log( GM_LOG_ERROR, "worker error: %s\n", gearman_worker_error( &worker ) );
            gearman_job_free_all( &worker );
            if ( ret == GEARMAN_TIMEOUT || ret == GEARMAN_TOO_MANY_ARGS ) {
                gearman_worker_unregister_all(&worker);
                gearman_worker_remove_servers(&worker);
lvasiliev commented 8 years ago

When the job server is not available, a growing number sockets with SYN_SENT state and Icinga don't work.

icinga# netstat -an | grep '.4830' | grep SYN_SENT | wc -l
   35434

[2016-03-17 18:48:26][43226][ERROR] sending job to gearmand failed: flush(GEARMAN_COULD_NOT_CONNECT) gworker:4830 -> libgearman/connection.cc:811 (346 lost jobs so far)
[2016-03-17 18:49:26][43226][ERROR] sending job to gearmand failed: flush(GEARMAN_COULD_NOT_CONNECT) gworker:4830 -> libgearman/connection.cc:811 (413 lost jobs so far)

[Thu Mar 17 18:50:05 2016] Error: Unable to open temp file '/var/spool/icinga/ramdisk/icinga.tmpwH3825' for writing status data: Too many open files
[Thu Mar 17 18:50:15 2016] Error: Unable to open temp file '/var/spool/icinga/ramdisk/icinga.tmpK1DPi9' for writing status data: Too many open files
[Thu Mar 17 18:50:25 2016] Error: Unable to open temp file '/var/spool/icinga/ramdisk/icinga.tmpBgTeHz' for writing status data: Too many open files
sni commented 8 years ago

Thats why i really recommend to run the gearmand on the save server next to icinga.

lvasiliev commented 8 years ago

It's very bad, I think to need added a slow mechanism for it.