grobian / carbon-c-relay

Enhanced C implementation of Carbon relay, aggregator and rewriter
Apache License 2.0
380 stars 107 forks source link

relay stalls on high concurrent tcp connections #318

Closed StephenPCG closed 7 years ago

StephenPCG commented 7 years ago

Code Version: master HEAD as of now Configuration:

cluster default
    forward
        127.0.0.1:3005
    ;

match *
    send to default
    stop
    ;

There are around 2000 collectd sending metrics to carbon-c-relay, and carbon-c-relay seg faulted soon after started. The following is gdb output:

gdb --args ~/relay.master -f ./relay-1.conf -c -_:#@ -w 32
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-80.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /root/relay.master...done.
(gdb) r
Starting program: /root/relay.master -f ./relay-1.conf -c -_:\#@ -w 32
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[2017-10-14 13:48:43] starting carbon-c-relay v3.1 (038004-dirty), pid=699
configuration:
    relay hostname = HOSTNAME
    workers = 32
    send batch size = 2500
    server queue size = 25000
    server max stalls = 4
    listen backlog = 32
    server connection IO timeout = 600ms
    extra allowed characters = -_:#@
    configuration = ./relay-1.conf

parsed configuration follows:
listen
    type linemode
        2003 proto tcp
        2003 proto udp
        /tmp/.s.carbon-c-relay.2003 proto unix
    ;

statistics
    submit every 60 seconds
    prefix with carbon.relays.HOSTNAME
    ;

cluster default
    forward
        127.0.0.1:3005
    ;

match *
    send to default
    stop
    ;

[2017-10-14 13:48:43] listening on tcp4 0.0.0.0 port 2003
[2017-10-14 13:48:43] listening on tcp6 :: port 2003
[2017-10-14 13:48:43] listening on udp4 0.0.0.0 port 2003
[2017-10-14 13:48:43] listening on udp6 :: port 2003
[2017-10-14 13:48:43] listening on UNIX socket /tmp/.s.carbon-c-relay.2003
[New Thread 0x7ffff63e0700 (LWP 705)]
[2017-10-14 13:48:43] starting 32 workers
[New Thread 0x7ffff5bdf700 (LWP 706)]
(snip similar contents ...)
[New Thread 0x7fffcffff700 (LWP 745)]
[2017-10-14 13:48:43] starting statistics collector
[New Thread 0x7fffcf7fe700 (LWP 746)]
[2017-10-14 13:48:43] starting servers
[New Thread 0x7fffceffd700 (LWP 747)]
[2017-10-14 13:48:43] startup sequence complete

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffdbfff700 (LWP 730)]
sockread (strm=0x7ffff63e1010, buf=0x7fffcc5c8049, sze=8191) at dispatcher.c:125
125 dispatcher.c: No such file or directory.
Missing separate debuginfos, use: debuginfo-install glibc-2.17-105.el7.x86_64
StephenPCG commented 7 years ago

I was using tag v3.1 in my production. It was fine until hours ago, we provisioned more servers, so more collectd agents were sending metrics. Soon after that carbon-c-relay stopped working. The carbon-c-relay process didn't quit, didn't print out any abnormal log, from top, the process's CPU was 0%, mem RSS was about 50MB which I think is normal. go-carbon which lives in the next level of the stack, does not receive any data after relay stalled.

So I tried the master HEAD code, the master code dies with seg fault, instead of silent stall. I'm not sure if they are of the same problem.

bingyulei007 commented 7 years ago

mark

StephenPCG commented 7 years ago

I think they were two different problems.

I tried adding the listen section in conf to let relay just listen on tcp port:

listen
    type linemode
        2003 proto tcp
    ;

Then the master code behaved just like v3.1, stalled soon after startup. And ctrl-c would't stop the process, must kill with kill -9:

[2017-10-14 14:50:22] listening on tcp4 0.0.0.0 port 2003
[2017-10-14 14:50:22] listening on tcp6 :: port 2003
[2017-10-14 14:50:22] starting 32 workers
[2017-10-14 14:50:22] starting statistics collector
[2017-10-14 14:50:22] starting servers
[2017-10-14 14:50:22] startup sequence complete
^C[2017-10-14 14:51:15] caught SIGINT
[2017-10-14 14:51:15] shutting down...
[2017-10-14 14:51:15] dispatch: could not find listener for socket, rejecting connection
^C^C^C^C

I then tried to let relay listen on udp port only,

listen
    type linemode
    2003 proto udp
    ;

And the master code seg faulted soon:

[2017-10-14 14:53:42] listening on udp4 0.0.0.0 port 2003
[2017-10-14 14:53:42] listening on udp6 :: port 2003
[New Thread 0x7ffff63e0700 (LWP 4908)]
[2017-10-14 14:53:42] starting 32 workers

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff63e0700 (LWP 4908)]
dispatch_runner (arg=0x55555577c160) at dispatcher.c:867
867 dispatcher.c: No such file or directory.
Missing separate debuginfos, use: debuginfo-install glibc-2.17-105.el7.x86_64
(gdb)
grobian commented 7 years ago

OK, thanks, I'll have a look at this. The stall is yet a bit unclear to me, but the crash I know where to look.

StephenPCG commented 7 years ago

@grobian Do you have any suggestions for me on how to find the cause of the stall?

As my previous test, after relay stalled, ctrl-c wouldn't quit the relay process, seems like there were dead locks, maybe you can start from this point, check what relay is waiting for while quitting which may possibly in a dead lock. I am not familiar with C, so I'm not able to trace the code myself, but I can help if you give me detailed commands to run.

grobian commented 7 years ago

what would help me, is a c-relay built with debugging symbols (and preferably also the code still available) and run via gdb. Then when you experience a stall, ctrl-c it, and produce a full backtrace for all threads (thread apply all bt)

StephenPCG commented 7 years ago

@grobian How to build c-relay with debugging symbols? I was building the binary with simply make, what options should pass to make?

StephenPCG commented 7 years ago

I compiled relay with the following command:

CFLAGS='-ggdb' ./configure
make

And following is gdb output of thread apply all bt after stall:

[2017-10-14 21:43:57] startup sequence complete
^C
Program received signal SIGINT, Interrupt.
0x00007ffff75b942d in nanosleep () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.17-105.el7.x86_64
(gdb)
(gdb)
(gdb)
(gdb) thread apply all bt

Thread 37 (Thread 0x7fffccff9700 (LWP 1805)):
#0  0x00007ffff75b942d in nanosleep () from /lib64/libc.so.6
#1  0x00007ffff75ea064 in usleep () from /lib64/libc.so.6
#2  0x0000555555573569 in server_queuereader (d=0x5555557857a0) at server.c:282
#3  0x00007ffff78c4dc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007ffff75f221d in clone () from /lib64/libc.so.6

Thread 36 (Thread 0x7fffcd7fa700 (LWP 1804)):
#0  0x00007ffff75b942d in nanosleep () from /lib64/libc.so.6
#1  0x00007ffff75b92c4 in sleep () from /lib64/libc.so.6
#2  0x0000555555575e99 in collector_runner (s=0x555555790340) at collector.c:144
#3  0x00007ffff78c4dc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007ffff75f221d in clone () from /lib64/libc.so.6

Thread 35 (Thread 0x7fffcdffb700 (LWP 1803)):
#0  0x00007ffff75b942d in nanosleep () from /lib64/libc.so.6
#1  0x00007ffff75ea064 in usleep () from /lib64/libc.so.6
#2  0x0000555555573569 in server_queuereader (d=0x555555790340) at server.c:282
#3  0x00007ffff78c4dc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007ffff75f221d in clone () from /lib64/libc.so.6

Thread 34 (Thread 0x7fffce7fc700 (LWP 1802)):
#0  0x00007ffff78cb22d in read () from /lib64/libpthread.so.0
#1  0x000055555555c781 in sockread (strm=0x7fffec06b6e8, buf=0x7fff59e35e71, sze=8191) at dispatcher.c:125
#2  0x000055555555d546 in dispatch_connection (conn=0x7fff59e35e38, self=0x555555790080, start=...) at dispatcher.c:677
#3  0x000055555555df32 in dispatch_runner (arg=0x555555790080) at dispatcher.c:947
#4  0x00007ffff78c4dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007ffff75f221d in clone () from /lib64/libc.so.6

Thread 33 (Thread 0x7fffceffd700 (LWP 1801)):
#0  0x00007ffff78cb22d in read () from /lib64/libpthread.so.0
#1  0x000055555555c781 in sockread (strm=0x7fffec044f40, buf=0x7fff59e0f6c9, sze=8191) at dispatcher.c:125
#2  0x000055555555d546 in dispatch_connection (conn=0x7fff59e0f690, self=0x55555578fdc0, start=...) at dispatcher.c:677
#3  0x000055555555df32 in dispatch_runner (arg=0x55555578fdc0) at dispatcher.c:947
#4  0x00007ffff78c4dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007ffff75f221d in clone () from /lib64/libc.so.6

Thread 32 (Thread 0x7fffcf7fe700 (LWP 1800)):
#0  0x00007ffff78cb22d in read () from /lib64/libpthread.so.0
#1  0x000055555555c781 in sockread (strm=0x7fffec073fb8, buf=0x7fff59e3e741, sze=8191) at dispatcher.c:125
---Type <return> to continue, or q <return> to quit---
#2  0x000055555555d546 in dispatch_connection (conn=0x7fff59e3e708, self=0x55555578fb00, start=...) at dispatcher.c:677
#3  0x000055555555df32 in dispatch_runner (arg=0x55555578fb00) at dispatcher.c:947
#4  0x00007ffff78c4dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007ffff75f221d in clone () from /lib64/libc.so.6

Thread 31 (Thread 0x7fffcffff700 (LWP 1799)):
#0  0x00007ffff78cb22d in read () from /lib64/libpthread.so.0
#1  0x000055555555c781 in sockread (strm=0x7fffec078420, buf=0x7fff59e42ba9, sze=8191) at dispatcher.c:125
#2  0x000055555555d546 in dispatch_connection (conn=0x7fff59e42b70, self=0x55555578f840, start=...) at dispatcher.c:677
#3  0x000055555555df32 in dispatch_runner (arg=0x55555578f840) at dispatcher.c:947
#4  0x00007ffff78c4dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007ffff75f221d in clone () from /lib64/libc.so.6

Thread 30 (Thread 0x7fffe4ff1700 (LWP 1798)):
#0  0x00007ffff78cb22d in read () from /lib64/libpthread.so.0
#1  0x000055555555c781 in sockread (strm=0x7fffec09ebc8, buf=0x7fff59e69351, sze=8191) at dispatcher.c:125
#2  0x000055555555d546 in dispatch_connection (conn=0x7fff59e69318, self=0x55555578f580, start=...) at dispatcher.c:677
#3  0x000055555555df32 in dispatch_runner (arg=0x55555578f580) at dispatcher.c:947
#4  0x00007ffff78c4dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007ffff75f221d in clone () from /lib64/libc.so.6

Thread 29 (Thread 0x7fffe57f2700 (LWP 1797)):
#0  0x00007ffff78cb22d in read () from /lib64/libpthread.so.0
#1  0x000055555555c781 in sockread (strm=0x7fffec07c888, buf=0x7fff59e47011, sze=8191) at dispatcher.c:125
#2  0x000055555555d546 in dispatch_connection (conn=0x7fff59e46fd8, self=0x55555578f2c0, start=...) at dispatcher.c:677
#3  0x000055555555df32 in dispatch_runner (arg=0x55555578f2c0) at dispatcher.c:947
#4  0x00007ffff78c4dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007ffff75f221d in clone () from /lib64/libc.so.6

Thread 28 (Thread 0x7fffe5ff3700 (LWP 1796)):
#0  0x00007ffff78cb22d in read () from /lib64/libpthread.so.0
#1  0x000055555555c781 in sockread (strm=0x7fffec08da28, buf=0x7fff59e581b1, sze=8191) at dispatcher.c:125
#2  0x000055555555d546 in dispatch_connection (conn=0x7fff59e58178, self=0x55555578f000, start=...) at dispatcher.c:677
#3  0x000055555555df32 in dispatch_runner (arg=0x55555578f000) at dispatcher.c:947
#4  0x00007ffff78c4dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007ffff75f221d in clone () from /lib64/libc.so.6

Thread 27 (Thread 0x7fffe67f4700 (LWP 1795)):
#0  0x00007ffff78cb22d in read () from /lib64/libpthread.so.0
#1  0x000055555555c781 in sockread (strm=0x7fffec0a3030, buf=0x7fff59e6d7b9, sze=8191) at dispatcher.c:125
#2  0x000055555555d546 in dispatch_connection (conn=0x7fff59e6d780, self=0x55555578ed40, start=...) at dispatcher.c:677
---Type <return> to continue, or q <return> to quit---
#3  0x000055555555df32 in dispatch_runner (arg=0x55555578ed40) at dispatcher.c:947
#4  0x00007ffff78c4dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007ffff75f221d in clone () from /lib64/libc.so.6

Thread 26 (Thread 0x7fffe6ff5700 (LWP 1794)):
#0  0x00007ffff78cb22d in read () from /lib64/libpthread.so.0
#1  0x000055555555c781 in sockread (strm=0x7fffec05e9b0, buf=0x7fff59e29139, sze=8191) at dispatcher.c:125
#2  0x000055555555d546 in dispatch_connection (conn=0x7fff59e29100, self=0x55555578ea80, start=...) at dispatcher.c:677
#3  0x000055555555df32 in dispatch_runner (arg=0x55555578ea80) at dispatcher.c:947
#4  0x00007ffff78c4dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007ffff75f221d in clone () from /lib64/libc.so.6

Thread 25 (Thread 0x7fffe77f6700 (LWP 1793)):
#0  0x00007ffff78cb22d in read () from /lib64/libpthread.so.0
#1  0x000055555555c781 in sockread (strm=0x7fffec03c670, buf=0x7fff59e06df9, sze=8191) at dispatcher.c:125
#2  0x000055555555d546 in dispatch_connection (conn=0x7fff59e06dc0, self=0x55555578e7c0, start=...) at dispatcher.c:677
#3  0x000055555555df32 in dispatch_runner (arg=0x55555578e7c0) at dispatcher.c:947
#4  0x00007ffff78c4dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007ffff75f221d in clone () from /lib64/libc.so.6

Thread 24 (Thread 0x7fffe7ff7700 (LWP 1792)):
#0  0x00007ffff78cb22d in read () from /lib64/libpthread.so.0
#1  0x000055555555c781 in sockread (strm=0x7fffec09a760, buf=0x7fff59e64ee9, sze=8191) at dispatcher.c:125
#2  0x000055555555d546 in dispatch_connection (conn=0x7fff59e64eb0, self=0x55555578e500, start=...) at dispatcher.c:677
#3  0x000055555555df32 in dispatch_runner (arg=0x55555578e500) at dispatcher.c:947
#4  0x00007ffff78c4dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007ffff75f221d in clone () from /lib64/libc.so.6

Thread 23 (Thread 0x7fffe87f8700 (LWP 1791)):
#0  0x00007ffff78cb22d in read () from /lib64/libpthread.so.0
#1  0x000055555555c781 in sockread (strm=0x7fffec033da0, buf=0x7fff59dfe529, sze=8191) at dispatcher.c:125
#2  0x000055555555d546 in dispatch_connection (conn=0x7fff59dfe4f0, self=0x55555578e240, start=...) at dispatcher.c:677
#3  0x000055555555df32 in dispatch_runner (arg=0x55555578e240) at dispatcher.c:947
#4  0x00007ffff78c4dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007ffff75f221d in clone () from /lib64/libc.so.6

Thread 22 (Thread 0x7fffe8ff9700 (LWP 1790)):
#0  0x00007ffff78cb22d in read () from /lib64/libpthread.so.0
#1  0x000055555555c781 in sockread (strm=0x7fffec02b4d0, buf=0x7fff59df5c59, sze=8191) at dispatcher.c:125
#2  0x000055555555d546 in dispatch_connection (conn=0x7fff59df5c20, self=0x5555557887f0, start=...) at dispatcher.c:677
#3  0x000055555555df32 in dispatch_runner (arg=0x5555557887f0) at dispatcher.c:947
---Type <return> to continue, or q <return> to quit---
#4  0x00007ffff78c4dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007ffff75f221d in clone () from /lib64/libc.so.6

Thread 21 (Thread 0x7fffe97fa700 (LWP 1789)):
#0  0x00007ffff78cb22d in read () from /lib64/libpthread.so.0
#1  0x000055555555c781 in sockread (strm=0x7fffec0493a8, buf=0x7fff59e13b31, sze=8191) at dispatcher.c:125
#2  0x000055555555d546 in dispatch_connection (conn=0x7fff59e13af8, self=0x555555788770, start=...) at dispatcher.c:677
#3  0x000055555555df32 in dispatch_runner (arg=0x555555788770) at dispatcher.c:947
#4  0x00007ffff78c4dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007ffff75f221d in clone () from /lib64/libc.so.6

Thread 20 (Thread 0x7fffe9ffb700 (LWP 1788)):
#0  0x00007ffff78cb22d in read () from /lib64/libpthread.so.0
#1  0x000055555555c781 in sockread (strm=0x7fffec091e90, buf=0x7fff59e5c619, sze=8191) at dispatcher.c:125
#2  0x000055555555d546 in dispatch_connection (conn=0x7fff59e5c5e0, self=0x555555785160, start=...) at dispatcher.c:677
#3  0x000055555555df32 in dispatch_runner (arg=0x555555785160) at dispatcher.c:947
#4  0x00007ffff78c4dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007ffff75f221d in clone () from /lib64/libc.so.6

Thread 19 (Thread 0x7fffea7fc700 (LWP 1787)):
#0  0x00007ffff78cb22d in read () from /lib64/libpthread.so.0
#1  0x000055555555c781 in sockread (strm=0x7fffec06fb50, buf=0x7fff59e3a2d9, sze=8191) at dispatcher.c:125
#2  0x000055555555d546 in dispatch_connection (conn=0x7fff59e3a2a0, self=0x555555785360, start=...) at dispatcher.c:677
#3  0x000055555555df32 in dispatch_runner (arg=0x555555785360) at dispatcher.c:947
#4  0x00007ffff78c4dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007ffff75f221d in clone () from /lib64/libc.so.6

Thread 18 (Thread 0x7fffeaffd700 (LWP 1786)):
#0  0x00007ffff78cb22d in read () from /lib64/libpthread.so.0
#1  0x000055555555c781 in sockread (strm=0x7fffec022c00, buf=0x7fff59ded389, sze=8191) at dispatcher.c:125
#2  0x000055555555d546 in dispatch_connection (conn=0x7fff59ded350, self=0x5555557886f0, start=...) at dispatcher.c:677
#3  0x000055555555df32 in dispatch_runner (arg=0x5555557886f0) at dispatcher.c:947
#4  0x00007ffff78c4dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007ffff75f221d in clone () from /lib64/libc.so.6

Thread 17 (Thread 0x7fffeb7fe700 (LWP 1785)):
#0  0x00007ffff78cb22d in read () from /lib64/libpthread.so.0
#1  0x000055555555c781 in sockread (strm=0x7fffec0962f8, buf=0x7fff59e60a81, sze=8191) at dispatcher.c:125
#2  0x000055555555d546 in dispatch_connection (conn=0x7fff59e60a48, self=0x555555788430, start=...) at dispatcher.c:677
#3  0x000055555555df32 in dispatch_runner (arg=0x555555788430) at dispatcher.c:947
#4  0x00007ffff78c4dc5 in start_thread () from /lib64/libpthread.so.0
---Type <return> to continue, or q <return> to quit---
#5  0x00007ffff75f221d in clone () from /lib64/libc.so.6

Thread 16 (Thread 0x7fffebfff700 (LWP 1784)):
#0  0x00007ffff78cb22d in read () from /lib64/libpthread.so.0
#1  0x000055555555c781 in sockread (strm=0x7fffec0a7498, buf=0x7fff59e71c21, sze=8191) at dispatcher.c:125
#2  0x000055555555d546 in dispatch_connection (conn=0x7fff59e71be8, self=0x555555788170, start=...) at dispatcher.c:677
#3  0x000055555555df32 in dispatch_runner (arg=0x555555788170) at dispatcher.c:947
#4  0x00007ffff78c4dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007ffff75f221d in clone () from /lib64/libc.so.6

Thread 15 (Thread 0x7ffff0cee700 (LWP 1783)):
#0  0x00007ffff78cb22d in read () from /lib64/libpthread.so.0
#1  0x000055555555c781 in sockread (strm=0x7fffec051c78, buf=0x7fff59e1c401, sze=8191) at dispatcher.c:125
#2  0x000055555555d546 in dispatch_connection (conn=0x7fff59e1c3c8, self=0x555555787eb0, start=...) at dispatcher.c:677
#3  0x000055555555df32 in dispatch_runner (arg=0x555555787eb0) at dispatcher.c:947
#4  0x00007ffff78c4dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007ffff75f221d in clone () from /lib64/libc.so.6

Thread 14 (Thread 0x7ffff14ef700 (LWP 1782)):
#0  0x00007ffff78cb22d in read () from /lib64/libpthread.so.0
#1  0x000055555555c781 in sockread (strm=0x7fffec04d810, buf=0x7fff59e17f99, sze=8191) at dispatcher.c:125
#2  0x000055555555d546 in dispatch_connection (conn=0x7fff59e17f60, self=0x555555787bf0, start=...) at dispatcher.c:677
#3  0x000055555555df32 in dispatch_runner (arg=0x555555787bf0) at dispatcher.c:947
#4  0x00007ffff78c4dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007ffff75f221d in clone () from /lib64/libc.so.6

Thread 13 (Thread 0x7ffff1cf0700 (LWP 1781)):
#0  0x00007ffff78cb22d in read () from /lib64/libpthread.so.0
#1  0x000055555555c781 in sockread (strm=0x7fffec02f938, buf=0x7fff59dfa0c1, sze=8191) at dispatcher.c:125
#2  0x000055555555d546 in dispatch_connection (conn=0x7fff59dfa088, self=0x555555787930, start=...) at dispatcher.c:677
#3  0x000055555555df32 in dispatch_runner (arg=0x555555787930) at dispatcher.c:947
#4  0x00007ffff78c4dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007ffff75f221d in clone () from /lib64/libc.so.6

Thread 12 (Thread 0x7ffff24f1700 (LWP 1780)):
#0  0x00007ffff78cb22d in read () from /lib64/libpthread.so.0
#1  0x000055555555c781 in sockread (strm=0x7fffec062e18, buf=0x7fff59e2d5a1, sze=8191) at dispatcher.c:125
#2  0x000055555555d546 in dispatch_connection (conn=0x7fff59e2d568, self=0x555555787670, start=...) at dispatcher.c:677
#3  0x000055555555df32 in dispatch_runner (arg=0x555555787670) at dispatcher.c:947
#4  0x00007ffff78c4dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007ffff75f221d in clone () from /lib64/libc.so.6
---Type <return> to continue, or q <return> to quit---

Thread 11 (Thread 0x7ffff2cf2700 (LWP 1779)):
#0  0x00007ffff78cb22d in read () from /lib64/libpthread.so.0
#1  0x000055555555c781 in sockread (strm=0x7fffec0895c0, buf=0x7fff59e53d49, sze=8191) at dispatcher.c:125
#2  0x000055555555d546 in dispatch_connection (conn=0x7fff59e53d10, self=0x5555557873b0, start=...) at dispatcher.c:677
#3  0x000055555555df32 in dispatch_runner (arg=0x5555557873b0) at dispatcher.c:947
#4  0x00007ffff78c4dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007ffff75f221d in clone () from /lib64/libc.so.6

Thread 10 (Thread 0x7ffff34f3700 (LWP 1778)):
#0  0x00007ffff78cb22d in read () from /lib64/libpthread.so.0
#1  0x000055555555c781 in sockread (strm=0x7fffec085158, buf=0x7fff59e4f8e1, sze=8191) at dispatcher.c:125
#2  0x000055555555d546 in dispatch_connection (conn=0x7fff59e4f8a8, self=0x5555557870f0, start=...) at dispatcher.c:677
#3  0x000055555555df32 in dispatch_runner (arg=0x5555557870f0) at dispatcher.c:947
#4  0x00007ffff78c4dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007ffff75f221d in clone () from /lib64/libc.so.6

Thread 9 (Thread 0x7ffff3cf4700 (LWP 1777)):
#0  0x00007ffff78cb22d in read () from /lib64/libpthread.so.0
#1  0x000055555555c781 in sockread (strm=0x7fffec027068, buf=0x7fff59df17f1, sze=8191) at dispatcher.c:125
#2  0x000055555555d546 in dispatch_connection (conn=0x7fff59df17b8, self=0x555555786e30, start=...) at dispatcher.c:677
#3  0x000055555555df32 in dispatch_runner (arg=0x555555786e30) at dispatcher.c:947
#4  0x00007ffff78c4dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007ffff75f221d in clone () from /lib64/libc.so.6

Thread 8 (Thread 0x7ffff44f5700 (LWP 1776)):
#0  0x00007ffff78cb22d in read () from /lib64/libpthread.so.0
#1  0x000055555555c781 in sockread (strm=0x7fffec067280, buf=0x7fff59e31a09, sze=8191) at dispatcher.c:125
#2  0x000055555555d546 in dispatch_connection (conn=0x7fff59e319d0, self=0x555555786b70, start=...) at dispatcher.c:677
#3  0x000055555555df32 in dispatch_runner (arg=0x555555786b70) at dispatcher.c:947
#4  0x00007ffff78c4dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007ffff75f221d in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x7ffff4cf6700 (LWP 1775)):
#0  0x00007ffff78cb22d in read () from /lib64/libpthread.so.0
#1  0x000055555555c781 in sockread (strm=0x7fffec0560e0, buf=0x7fff59e20869, sze=8191) at dispatcher.c:125
#2  0x000055555555d546 in dispatch_connection (conn=0x7fff59e20830, self=0x5555557868b0, start=...) at dispatcher.c:677
#3  0x000055555555df32 in dispatch_runner (arg=0x5555557868b0) at dispatcher.c:947
#4  0x00007ffff78c4dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007ffff75f221d in clone () from /lib64/libc.so.6

---Type <return> to continue, or q <return> to quit---
Thread 6 (Thread 0x7ffff54f7700 (LWP 1774)):
#0  0x00007ffff78cb22d in read () from /lib64/libpthread.so.0
#1  0x000055555555c781 in sockread (strm=0x7fffec080cf0, buf=0x7fff59e4b479, sze=8191) at dispatcher.c:125
#2  0x000055555555d546 in dispatch_connection (conn=0x7fff59e4b440, self=0x5555557865f0, start=...) at dispatcher.c:677
#3  0x000055555555df32 in dispatch_runner (arg=0x5555557865f0) at dispatcher.c:947
#4  0x00007ffff78c4dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007ffff75f221d in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7ffff5cf8700 (LWP 1773)):
#0  0x00007ffff78cb22d in read () from /lib64/libpthread.so.0
#1  0x000055555555c781 in sockread (strm=0x7fffec040ad8, buf=0x7fff59e0b261, sze=8191) at dispatcher.c:125
#2  0x000055555555d546 in dispatch_connection (conn=0x7fff59e0b228, self=0x555555786330, start=...) at dispatcher.c:677
#3  0x000055555555df32 in dispatch_runner (arg=0x555555786330) at dispatcher.c:947
#4  0x00007ffff78c4dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007ffff75f221d in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7ffff64f9700 (LWP 1772)):
#0  0x00007ffff78cb22d in read () from /lib64/libpthread.so.0
#1  0x000055555555c781 in sockread (strm=0x7fffec05a548, buf=0x7fff59e24cd1, sze=8191) at dispatcher.c:125
#2  0x000055555555d546 in dispatch_connection (conn=0x7fff59e24c98, self=0x555555786070, start=...) at dispatcher.c:677
#3  0x000055555555df32 in dispatch_runner (arg=0x555555786070) at dispatcher.c:947
#4  0x00007ffff78c4dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007ffff75f221d in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7ffff6cfa700 (LWP 1771)):
#0  0x00007ffff78cb22d in read () from /lib64/libpthread.so.0
#1  0x000055555555c781 in sockread (strm=0x7fffec038208, buf=0x7fff59e02991, sze=8191) at dispatcher.c:125
#2  0x000055555555d546 in dispatch_connection (conn=0x7fff59e02958, self=0x555555785db0, start=...) at dispatcher.c:677
#3  0x000055555555df32 in dispatch_runner (arg=0x555555785db0) at dispatcher.c:947
#4  0x00007ffff78c4dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007ffff75f221d in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7ffff74fb700 (LWP 1770)):
#0  0x00007ffff75e7bcd in poll () from /lib64/libc.so.6
#1  0x000055555555dc27 in dispatch_runner (arg=0x555555785af0) at dispatcher.c:876
#2  0x00007ffff78c4dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007ffff75f221d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7ffff7fef740 (LWP 1766)):
#0  0x00007ffff75b942d in nanosleep () from /lib64/libc.so.6
#1  0x00007ffff75b92c4 in sleep () from /lib64/libc.so.6
---Type <return> to continue, or q <return> to quit---
#2  0x0000555555559aca in main (argc=3, argv=0x7fffffffe4b8) at relay.c:892
StephenPCG commented 7 years ago

Not sure if this information would help, I used nc to listen to a port to receive data from carbon-c-relay, nc exited after some time, so I assume it's relay that closed the connection.

$ nc -l 3005 | dd of=/dev/null
102+294699 records in
88845+1 records out
45489078 bytes (45 MB) copied, 14.1395 s, 3.2 MB/s
grobian commented 7 years ago

This info will help, thanks! The relay closes "idle" connections after 10sec or so. It only does this when it is still "working" on a server connection. So likely it stalls somewhere in the dispatching.

grobian commented 7 years ago

In your scenario there is just a single server, which is sleeping because nothing comes in. All the dispatchers (a lot actually) are reading, which is highly suspicious (none are actually working).

grobian commented 7 years ago

If possible, I'd like to have a gdb backtrace for the udp crash.

StephenPCG commented 7 years ago

@grobian gdb backtrace for the udp crash:

$ gdb --args ./relay -f relay-1.conf -w 1
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-80.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /tmp/carbon-c-relay/relay...done.
(gdb) r
Starting program: /tmp/carbon-c-relay/./relay -f relay-1.conf -w 1
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[2017-10-16 09:35:41] starting carbon-c-relay v3.1 (038004-dirty), pid=25236
configuration:
    relay hostname = HOSTNAME
    workers = 1
    send batch size = 2500
    server queue size = 25000
    server max stalls = 4
    listen backlog = 32
    server connection IO timeout = 600ms
    configuration = relay-1.conf

parsed configuration follows:
listen
    type linemode
        2003 proto udp
    ;

statistics
    submit every 60 seconds
    prefix with carbon.relays.HOSTNAME
    ;

cluster default
    forward
        127.0.0.1:3005
    ;

match *
    send to default
    stop
    ;

[2017-10-16 09:35:41] listening on udp4 0.0.0.0 port 2003
[2017-10-16 09:35:41] listening on udp6 :: port 2003
[New Thread 0x7ffff63e0700 (LWP 25241)]
[2017-10-16 09:35:41] starting 1 workers

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff63e0700 (LWP 25241)]
0x000055555555db93 in dispatch_runner (arg=0x555555785160) at dispatcher.c:867
867                 if (listeners[c] == NULL)
Missing separate debuginfos, use: debuginfo-install glibc-2.17-105.el7.x86_64
(gdb) thread apply all bt

Thread 2 (Thread 0x7ffff63e0700 (LWP 25241)):
#0  0x000055555555db93 in dispatch_runner (arg=0x555555785160) at dispatcher.c:867
#1  0x00007ffff78c4dc5 in start_thread () from /lib64/libpthread.so.0
#2  0x00007ffff75f221d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7ffff7fef740 (LWP 25236)):
#0  0x00007ffff75f21e1 in clone () from /lib64/libc.so.6
#1  0x00007ffff78c3f7a in do_clone.constprop.4 () from /lib64/libpthread.so.0
#2  0x00007ffff78c5768 in pthread_create@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
#3  0x000055555555e1ff in dispatch_new (id=0 '\000', type=CONNECTION, r=0x555555785090, allowed_chars=0x55555557abd3 "-_:#") at dispatcher.c:1008
#4  0x000055555555e285 in dispatch_new_connection (id=0 '\000', r=0x555555785090, allowed_chars=0x55555557abd3 "-_:#") at dispatcher.c:1039
#5  0x00005555555597dd in main (argc=5, argv=0x7fffffffe478) at relay.c:837
StephenPCG commented 7 years ago

@grobian I wrote a small go program to mimic concurrent tcp connections to send metrics to relay, and find an interesting number. If number of connections is less equal to 1024, relay works fine, but if the number is greater than 1024, relay stalls immediately.

Here is the program:

package main

import (
        "flag"
        "fmt"
        "net"
        "sync"
        "time"
)
func worker(id int, wg *sync.WaitGroup, addr string, interval time.Duration, numMetrics int) {
        defer wg.Done()
        conn, err := net.Dial("tcp", addr)
        if err != nil {
                fmt.Printf("worker %d: failed to connect to %s, err: %s\n", id, addr, err)
                return
        }

        for now := range time.Tick(interval) {
                nowUnix := now.Unix()
                var msg string
                for i := 0; i < numMetrics; i++ {
                        msg = msg + fmt.Sprintf("test.worker-%d.metric-%d 1 %d\n", id, i, nowUnix)
                }
                if n, err := conn.Write([]byte(msg)); err != nil {
                        fmt.Printf("worker %d: failed to write tcp\n", id)
                        return
                } else {
                        fmt.Printf("worker %d: sent %d bytes\n", id, n)
                }
        }
}

func main() {
        numConnections := flag.Int("n", 1, "number of connections to send metrics")
        interval := flag.Duration("i", 1*time.Second, "interval of each connection to send metrics")
        numMetrics := flag.Int("m", 1, "number of metrics to send of each connection at each interval")
        remoteAddr := flag.String("a", "127.0.0.1:2003", "remote address, $IP:$PORT")
        flag.Parse()

        fmt.Printf("Num Connections: %d, Interval: %s, Num Metrics per Send: %d\n", *numConnections, *interval, *numMetrics)
        wg := &sync.WaitGroup{}
        for i := 0; i < *numConnections; i++ {
                wg.Add(1)
                go worker(i, wg, *remoteAddr, *interval, *numMetrics)
        }
        wg.Wait()
}

You can save it to stress.go and build it with: go build stress.go, then run:

./stress -n 1025 # use 1025 connections to send metrics
grobian commented 7 years ago

Thanks! This seems related to increasing the connections array.

StephenPCG commented 7 years ago

@grobian I think I may have found the bug. It in dispatch_addconnection().

I added the following code before and after extending the connection list:

                for (i = 0; i < connectionslen; i++) {
                        if (connections[i].sock != 0) {
                                logout("connections[%d].sock = %d, (connections[%d].strm) = %p\n", i, connections[i].sock, i, connections[c].strm);
                        }
                }

And here is the output:

[2017-10-16 15:07:34] connections[0].sock = 5, (connections[0].strm) = 0x20341
[2017-10-16 15:07:34] connections[1].sock = 6, (connections[1].strm) = 0x20341
[2017-10-16 15:07:34] connections[2].sock = 7, (connections[2].strm) = 0x20341
[2017-10-16 15:07:34] connections[3].sock = 8, (connections[3].strm) = 0x20341
[2017-10-16 15:07:34] connections[4].sock = 9, (connections[4].strm) = 0x20341
[2017-10-16 15:07:34] connections[5].sock = 10, (connections[5].strm) = 0x20341
...
[2017-10-16 15:07:34] connection list extended, current len: 256, current taken: 128
[2017-10-16 15:07:34] connections[0].sock = 5, (connections[0].strm) = (nil)
[2017-10-16 15:07:34] connections[1].sock = 6, (connections[1].strm) = (nil)
[2017-10-16 15:07:34] connections[2].sock = 7, (connections[2].strm) = (nil)
[2017-10-16 15:07:34] connections[3].sock = 8, (connections[3].strm) = (nil)
[2017-10-16 15:07:34] connections[4].sock = 9, (connections[4].strm) = (nil)

As you can see, after connections extended, all strm becomes nil. And then, in sockread(), it's effectively calling read(0, buf, sze), which is reading from stdin.

I added the following line in sockread():

logout("read sock: %d\n", *(int *)strm);

And when relay stalls, the last line displayed is read sock: 0, if I press enter, relay prints out another read sock: 0. So, when relay stalls, all of it's dispatch runners are reading from stdin.

grobian commented 7 years ago

Right, thanks for looking into it! I don't understand why it makes the ptrs nil, it should be garbage, or better, stale pointers. But I do see the problem, sockread readers use a pointer to the sock, which possibly changes after a realloc. I need to fix that.

I also need to look into why it is allowed to read from stdin, I though I closed fd 0.

StephenPCG commented 7 years ago

I'm sorry, I had a typo in the debug log, connections[c].strm should be connections[i].strm in the loop.

I think the real problem is here: https://github.com/grobian/carbon-c-relay/blob/master/dispatcher.c#L499

connections[c].strm = &connections[c].sock;

When connections is extended, all connections[*].sock addresses have changed, so all strm pointers are wrong. They may not all be nil.

StephenPCG commented 7 years ago

Maybe this patch should fix?

diff --git a/dispatcher.c b/dispatcher.c
index b4303ba..58ae853 100644
--- a/dispatcher.c
+++ b/dispatcher.c
@@ -506,7 +506,9 @@ dispatch_addconnection(int sock, listener *lsnr)
        }
        connections[c].sock = sock;
        if (lsnr == NULL || lsnr->transport == W_PLAIN) {
-               connections[c].strm = &connections[c].sock;
+               int *strm = malloc(sizeof(sock));
+               *strm = sock;
+               connections[c].strm = strm;
                connections[c].strmread = &sockread;
                connections[c].strmclose = &sockclose;
        }
StephenPCG commented 7 years ago

Plus free():

diff --git a/dispatcher.c b/dispatcher.c
index b4303ba..0a4cf7b 100644
--- a/dispatcher.c
+++ b/dispatcher.c
@@ -128,7 +128,9 @@ sockread(void *strm, void *buf, size_t sze)
 static inline int
 sockclose(void *strm)
 {
-       return close(*((int *)strm));
+       int sock = *((int *)strm);
+       free(strm);
+       return close(sock);
 }

 #ifdef HAVE_GZIP
@@ -506,7 +508,9 @@ dispatch_addconnection(int sock, listener *lsnr)
        }
        connections[c].sock = sock;
        if (lsnr == NULL || lsnr->transport == W_PLAIN) {
-               connections[c].strm = &connections[c].sock;
+               int *strm = malloc(sizeof(sock));
+               *strm = sock;
+               connections[c].strm = strm;
                connections[c].strmread = &sockread;
                connections[c].strmclose = &sockclose;
        }
StephenPCG commented 7 years ago

After scanning the code, I think relay will only close stdin if running in daemon mode (and I think this is reasonable). I have tried running relay in daemon mode, and confirmed that stdin was closed by examing /proc/$(pidof relay)/fd/0.

https://github.com/grobian/carbon-c-relay/blob/master/relay.c#L646

grobian commented 7 years ago

A malloc would indeed solve the problem, and brings it inline with the other impls.

grobian commented 7 years ago

Please let me know if this fixes the problem(s) for you.

StephenPCG commented 7 years ago

relay won't start now :(

Seg fault immediately after start.

Program received signal SIGSEGV, Segmentation fault.
0x000055555555c918 in dispatch_addlistener (lsnr=0x7ffff7fab058) at dispatcher.c:333
333                     if (listeners[c] == NULL) {

I think it's caused by 6e5df81

grobian commented 7 years ago

sweet, this only crashes on Linux, not on Darwin.

grobian commented 7 years ago

thinko, dispatch_new_listener is executed after dispatch_addlistener

grobian commented 7 years ago

fixed, please try again, sorry.

StephenPCG commented 7 years ago

It seems to be fine now, thanks a lot!