grobian / carbon-c-relay

Enhanced C implementation of Carbon relay, aggregator and rewriter
Apache License 2.0
380 stars 107 forks source link

Segmentation fault if any cluster host can not be resolved #455

Closed harloprillar closed 4 months ago

harloprillar commented 1 year ago

Server crashed with Segmentation fault if any cluster host can not be resolved (either immideately if useall cluster configuration parameter defined, or after first attempt to relay received metric if not) Assuming simple config with one cluster host:

cluster default
    any_of useall
      none.example.com:80
    proto tcp transport plain;

match *
    send to
      default
    stop;

statistics
    submit every 60 seconds
    reset counters after interval
    send to default
    stop
    ;

listen
      type linemode transport plain
      2003 proto tcp
    ;

There is some additional Valgrind output for debug:

/ # valgrind -q /usr/bin/carbon-c-relay -f /etc/carbon-c-relay/carbon-c-relay.conf
[2023-01-18 14:03:03] starting carbon-c-relay v3.7.4 (855740), pid=92
configuration:
    relay hostname = 270652b01ac5
    workers = 8
    send batch size = 2500
    server queue size = 25000
    server max stalls = 4
    listen backlog = 32
    server connection IO timeout = 600ms
    idle connections disconnect timeout = 10m
    configuration = /etc/carbon-c-relay/carbon-c-relay.conf

==92== Invalid read of size 4
==92==    at 0x119AD7: router_add_server (router.c:621)
==92==    by 0x113A3A: router_yyparse (conffile.y:244)
==92==    by 0x11DF94: router_readconfig (router.c:1345)
==92==    by 0x10D018: main (relay.c:888)
==92==  Address 0x4 is not stack'd, malloc'd or (recently) free'd
==92==
==92==
==92== Process terminating with default action of signal 11 (SIGSEGV)
==92==  Access not within mapped region at address 0x4
==92==    at 0x119AD7: router_add_server (router.c:621)
==92==    by 0x113A3A: router_yyparse (conffile.y:244)
==92==    by 0x11DF94: router_readconfig (router.c:1345)
==92==    by 0x10D018: main (relay.c:888)
==92==  If you believe this happened as a result of a stack
==92==  overflow in your program's main thread (unlikely but
==92==  possible), you can try to increase the size of the
==92==  main thread stack using the --main-stacksize= flag.
==92==  The main thread stack size used in this run was 8388608.
Segmentation fault
grobian commented 1 year ago

hmmm, that's a stupid logic error :(

grobian commented 1 year ago

we must ignore unresolvable addresses for https://github.com/grobian/carbon-c-relay/issues/293

grobian commented 1 year ago

Any chance you could try 1e1032b to see if that solves your problem?

harloprillar commented 1 year ago

Seems like changes resolved issue only partially, the server now doesn't crash after startup, but still crashes after the first attempt to relay metric to a host with unresolvable address.

Server started:

/ # valgrind -q /usr/bin/carbon-c-relay -f /etc/carbon-c-relay/carbon-c-relay.conf
[2023-01-20 09:19:57] starting carbon-c-relay v3.7.4 (1e1032), pid=96
configuration:
    relay hostname = 2afbcff32796
    workers = 8
    send batch size = 2500
    server queue size = 25000
    server max stalls = 4
    listen backlog = 32
    server connection IO timeout = 600ms
    idle connections disconnect timeout = 10m
    configuration = /etc/carbon-c-relay/carbon-c-relay.conf

==96== Conditional jump or move depends on uninitialised value(s)
==96==    at 0x114DB2: router_yyparse (conffile.y:964)
==96==    by 0x11E8CE: router_readconfig (router.c:1344)
==96==    by 0x10D091: main (relay.c:888)
==96==
parsed configuration follows:
listen
    type linemode
        2003 proto tcp
    ;

statistics
    submit every 60 seconds
    reset counters after interval
    prefix with carbon.relays.2afbcff32796
    send to default
    stop
    ;

cluster default
    forward
        none.example.com:80
    ;

match *
    send to default
    stop
    ;

[2023-01-20 09:19:57] listening on tcp4 0.0.0.0 port 2003
[2023-01-20 09:19:57] listening on tcp6 :: port 2003
[2023-01-20 09:19:57] starting 8 workers
[2023-01-20 09:19:57] starting statistics collector
[2023-01-20 09:19:57] starting servers
[2023-01-20 09:19:57] startup sequence complete

Then, I'm trying to send metric to carbon-c-relay:

echo "foo.bar.baz 1 `date +%s`" | nc localhost 2003

And the server now crashed:


==109== Thread 13:
==109== Invalid read of size 8
==109==    at 0x403AA11: freeaddrinfo (in /lib/ld-musl-x86_64.so.1)
==109==    by 0x4E57A8F: ???
==109==    by 0x493874F: ???
==109==    by 0x49388CF: ???
==109==    by 0x1218CE: server_queuereader (server.c:654)
==109==    by 0x405608A: ??? (in /lib/ld-musl-x86_64.so.1)
==109==  Address 0x28 is not stack'd, malloc'd or (recently) free'd
==109==
==109==
==109== Process terminating with default action of signal 11 (SIGSEGV)
==109==  Access not within mapped region at address 0x28
==109==    at 0x403AA11: freeaddrinfo (in /lib/ld-musl-x86_64.so.1)
==109==    by 0x4E57A8F: ???
==109==    by 0x493874F: ???
==109==    by 0x49388CF: ???
==109==    by 0x1218CE: server_queuereader (server.c:654)
==109==    by 0x405608A: ??? (in /lib/ld-musl-x86_64.so.1)
==109==  If you believe this happened as a result of a stack
==109==  overflow in your program's main thread (unlikely but
==109==  possible), you can try to increase the size of the
==109==  main thread stack using the --main-stacksize= flag.
==109==  The main thread stack size used in this run was 8388608.
Segmentation fault

(btw, process would crush anyway with the same error after a while even without attempt to relay received metric)

grobian commented 1 year ago

I tried resolving the first complaint from valgrind, as well as the second. Key here is that I previously always tested on glibc hosts, where freeaddrinfo(NULL) just returns, while musl (rightly so) assumes the argument is set.

harloprillar commented 1 year ago

I see, so the reason why this error appeared in my case was using Alpine as a build container, which is musl-backed. I tried again with the last changes. It looks like now issue is completely resolved, as the server keeps running with just logging about unresovable address: [2023-01-20 10:52:07] failed to resolve none.example.com:80, server unavailable

Thanks!