grobian / carbon-c-relay

Enhanced C implementation of Carbon relay, aggregator and rewriter
Apache License 2.0
380 stars 107 forks source link

3.6 tests fail on Fedora Rawhide #403

Closed ignatenkobrain closed 1 year ago

ignatenkobrain commented 4 years ago

Hello,

I am building latest version of carbon-c-relay in Fedora, but some tests are failing.

basic: relay 1 FAIL
--- basic.payloadout
+++ basic.payloadout
@@ -7,5 +7,4 @@
 bar 1 3
 bar.foo 1 4
 rewrite.foo 1 4
-aggregate.foo.bar 1.000000 349830000
 relay 1: aggregator: dropping incorrect metric: foo.bar 1
metriclimits: relay 1 PASS
buftest.gz: gzip relay 1 PASS
buftest.lz4: lz4 relay 1 PASS
large: relay 1 PASS
dual-gzip: relay 1 relay 2 PASS
large-gzip: failed to setup listener
failed to start relay 1 in /builddir/build/BUILD/carbon-c-relay-3.6/test:
../relay -d -f /tmp/tmp.RR8IvGMoUU/conf -Htest.hostname -s -D -l /tmp/tmp.RR8IvGMoUU/relay-1.out -P /tmp/tmp.RR8IvGMoUU/pidfile-1
=== /tmp/tmp.RR8IvGMoUU/conf ===
# relay 1, mode DUAL
listen type linemode 
    /tmp/tmp.RR8IvGMoUU/sock.3020 proto unix
    ;
cluster default
    file /tmp/tmp.RR8IvGMoUU/data.out
    ;
# contents from large-gzip.stst below this line
listen type linemode transport gzip 127.0.0.1:3020 proto tcp;
match ^through-gzip\. send to default;
=== /tmp/tmp.RR8IvGMoUU/relay-1.out ===
[1981-01-31 23:00:00] (MSG) starting carbon-c-relay v3.6 (2019-10-26), pid=2042166
configuration:
    relay hostname = test.hostname
    workers = 1
    send batch size = 2500
    server queue size = 25000
    server max stalls = 4
    listen backlog = 32
    server connection IO timeout = 600ms
    idle connections disconnect timeout = 10m
    debug = true
    configuration = /tmp/tmp.RR8IvGMoUU/conf
parsed configuration follows:
listen
    type linemode
        /tmp/tmp.RR8IvGMoUU/sock.3020 proto unix
    type linemode transport gzip
        127.0.0.1:3020 proto tcp
    ;
statistics
    submit every 60 seconds
    prefix with carbon.relays.test_hostname
    ;
cluster default
    file
        /tmp/tmp.RR8IvGMoUU/data.out
    ;
match ^through-gzip\.
    send to default
    ;
[1981-01-31 23:00:00] (MSG) listening on UNIX socket /tmp/tmp.RR8IvGMoUU/sock.3020
[1981-01-31 23:00:00] (ERR) failed to bind on tcp4 127.0.0.1 port 3020: Address already in use
dual-ssl: failed to setup listener
failed to start relay 1 in /builddir/build/BUILD/carbon-c-relay-3.6/test:
../relay -d -f /tmp/tmp.FXYGnVnm6u/conf -Htest.hostname -s -D -l /tmp/tmp.FXYGnVnm6u/relay-1.out -P /tmp/tmp.FXYGnVnm6u/pidfile-1 -C dual-ssl.cert
=== /tmp/tmp.FXYGnVnm6u/conf ===
# relay 1, mode DUAL
listen type linemode 
    /tmp/tmp.FXYGnVnm6u/sock.3020 proto unix
    ;
cluster default
    file /tmp/tmp.FXYGnVnm6u/data.out
    ;
# extra arguments given to ../relay:
#    -C dual-ssl.cert
# contents from dual-ssl.stst below this line
# ssl-based test
# to make a cert, here's some very brief instructions:
#  openssl req -new > cert.csr
#  openssl rsa -in privkey.pem -out key.pem
#  openssl x509 -in cert.csr -out cert.pem -req -signkey key.pem -days 1001
#  cat key.pem >> cert.pem
# use the resulting cert.pem as dual.cert
listen type linemode transport plain ssl dual-ssl.cert 127.0.0.1:3020 proto tcp;
match ^through-ssl\. send to default;
=== /tmp/tmp.FXYGnVnm6u/relay-1.out ===
[1981-01-31 23:00:00] (MSG) starting carbon-c-relay v3.6 (2019-10-26), pid=2042173
configuration:
    relay hostname = test.hostname
    workers = 1
    send batch size = 2500
    server queue size = 25000
    server max stalls = 4
    listen backlog = 32
    tls/ssl CA = dual-ssl.cert
    server connection IO timeout = 600ms
    idle connections disconnect timeout = 10m
    debug = true
    configuration = /tmp/tmp.FXYGnVnm6u/conf
parsed configuration follows:
listen
    type linemode
        /tmp/tmp.FXYGnVnm6u/sock.3020 proto unix
    type linemode transport  ssl dual-ssl.cert  # mtime 1572085065.0
        127.0.0.1:3020 proto tcp
    ;
statistics
    submit every 60 seconds
    prefix with carbon.relays.test_hostname
    ;
cluster default
    file
        /tmp/tmp.FXYGnVnm6u/data.out
    ;
match ^through-ssl\.
    send to default
    ;
[1981-01-31 23:00:00] (MSG) listening on UNIX socket /tmp/tmp.FXYGnVnm6u/sock.3020
[1981-01-31 23:00:00] (ERR) failed to bind on tcp4 127.0.0.1 port 3020: Address already in use
large-ssl: failed to setup listener
failed to start relay 1 in /builddir/build/BUILD/carbon-c-relay-3.6/test:
../relay -d -f /tmp/tmp.ckAtjyYsqs/conf -Htest.hostname -s -D -l /tmp/tmp.ckAtjyYsqs/relay-1.out -P /tmp/tmp.ckAtjyYsqs/pidfile-1 -C large-ssl.cert
=== /tmp/tmp.ckAtjyYsqs/conf ===
# relay 1, mode DUAL
listen type linemode 
    /tmp/tmp.ckAtjyYsqs/sock.3020 proto unix
    ;
cluster default
    file /tmp/tmp.ckAtjyYsqs/data.out
    ;
# extra arguments given to ../relay:
#    -C large-ssl.cert
# contents from large-ssl.stst below this line
# ssl-based test
# to make a cert, here's some very brief instructions:
#  openssl req -new > cert.csr
#  openssl rsa -in privkey.pem -out key.pem
#  openssl x509 -in cert.csr -out cert.pem -req -signkey key.pem -days 1001
#  cat key.pem >> cert.pem
# use the resulting cert.pem as dual.cert
listen type linemode transport plain ssl large-ssl.cert 127.0.0.1:3020 proto tcp;
match ^through-ssl\. send to default;
=== /tmp/tmp.ckAtjyYsqs/relay-1.out ===
[1981-01-31 23:00:00] (MSG) starting carbon-c-relay v3.6 (2019-10-26), pid=2042180
configuration:
    relay hostname = test.hostname
    workers = 1
    send batch size = 2500
    server queue size = 25000
    server max stalls = 4
    listen backlog = 32
    tls/ssl CA = large-ssl.cert
    server connection IO timeout = 600ms
    idle connections disconnect timeout = 10m
    debug = true
    configuration = /tmp/tmp.ckAtjyYsqs/conf
parsed configuration follows:
listen
    type linemode
        /tmp/tmp.ckAtjyYsqs/sock.3020 proto unix
    type linemode transport  ssl large-ssl.cert  # mtime 1572085065.0
        127.0.0.1:3020 proto tcp
    ;
statistics
    submit every 60 seconds
    prefix with carbon.relays.test_hostname
    ;
cluster default
    file
        /tmp/tmp.ckAtjyYsqs/data.out
    ;
match ^through-ssl\.
    send to default
    ;
[1981-01-31 23:00:00] (MSG) listening on UNIX socket /tmp/tmp.ckAtjyYsqs/sock.3020
[1981-01-31 23:00:00] (ERR) failed to bind on tcp4 127.0.0.1 port 3020: Address already in use
Ran 35 tests with 4 failing

I have added some netstat + strace, but could not find anything suspicious: https://kojipkgs.fedoraproject.org//work/tasks/309/40920309/build.log

ignatenkobrain commented 4 years ago

I see that it broke first time (on 3.4 version) after these dependencies update: https://koschei.fedoraproject.org/build/6588175

ignatenkobrain commented 4 years ago

@grobian any idea?

grobian commented 4 years ago

Yeah, I think I once meant to probe for a free port to use, but this is likely a TODO. So it uses a port that's probably in use by something else on your env.

You could try simply replacing 3020 with something different in

test/run-test.sh:       local start_server_lastport=3020  # TODO
ignatenkobrain commented 4 years ago

As you can see in the log, I explicitly run ss before to check which ports are in use, and there is no 3020 one.

grobian commented 4 years ago

working theory at the moment is that for some reason the port increment isn't working, so the second server start uses the same port and thus fails

update: scratch that, it reports relay 1 to fail.

ignatenkobrain commented 4 years ago

Interistingly, it works fine if I run rpmbuild on my laptop, but not in mock (something which is setting up chroot and executes rpmbuild inside).

grobian commented 4 years ago

I'm wondering, is there anything specific towards the network stack in this env? e.g. disabled ipv6 or something?

ignatenkobrain commented 4 years ago

There is no networking set up. Though, I did not see before that you should not be able to bind to a port.

grobian commented 4 years ago

Does that mean that any bind() is supposed to fail in the buildenv? The tests try to ensure data is being transferred across unix and TCP sockets.

ignatenkobrain commented 4 years ago

No, you should be able to bind() there without any problems. For example, running python3 -m http.server works just fine in that environment. Also what is interesting, in very same environment tests pass just fine on CentOS7. And if you see link above, it also worked in Fedora until some dependencies update.

https://koschei.fedoraproject.org/build/6588175

grobian commented 4 years ago

Hmmmm, I'm a bit at loss here. Since I just use bind(), it seems to suggest this is glibc related (as only package that I can imagine influencing this, except for the kernel itself). I'm already using SO_REUSEADDR, fwiw.

ignatenkobrain commented 4 years ago

I see that glibc was updated... 2.29.9000-27.fc31 → 2.29.9000-29.fc31

Curious if that could be related.

ignatenkobrain commented 4 years ago

@fweimer any ideas here?

ignatenkobrain commented 4 years ago

If I read it correctly, all list of changes are:

❯ git log --oneline 51ea67d54882318c4fa5394c386f4816ddc22408..21cc130b78a4db9113fb6695e2b951e697662440
21cc130b78 libio: do not attempt to free wide buffers of legacy streams [BZ #24228]
49bc41b642 [powerpc] add 'volatile' to asm
335c1007bf powerpc: Fix static-linked version of __ppc_get_timebase_freq [BZ #24640]
f59a54ab0c nl_AW locale: Correct the negative monetary format (bug 24614).
f0c5a803bd Fix gcc 9 build errors for make xcheck. [BZ #24556]
fabf5e49dd dlfcn: Avoid one-element flexible array in Dl_serinfo [BZ #24166]
2c75b545de elf: Refuse to dlopen PIE objects [BZ #24323]
02d8b5ab1c nl_NL locale: Correct the negative monetary format (bug 24614).
112a0ae18b m68k: Remove vDSO support
dee07df1a4 powerpc: Refactor powerpc64 lround/lroundf/llround/llroundf
2166283fcc powerpc: Refactor powerpc32 lrint/lrintf/llrint/llrintf
78049de0a9 powerpc: refactor powerpc64 lrint/lrintf/llrint/llrintf
48c3c12389 Linux: Fix __glibc_has_include use for <sys/stat.h> and statx
8d141877e0 <sys/cdefs.h>: Inhibit macro expansion for __glibc_has_include
cf27468602 Add IPV6_ROUTER_ALERT_ISOLATE from Linux 5.1 to bits/in.h.
a26e2e9fea Allow memset local PLT reference for powerpc soft-float.
82bc69c012 aarch64: handle STO_AARCH64_VARIANT_PCS
55f82d328d aarch64: add STO_AARCH64_VARIANT_PCS and DT_AARCH64_VARIANT_PCS
1192696069 powerpc: Remove optimized finite
a72186761b math: Use wordsize-64 version for finite
6427a6ac8c powerpc: Remove optimized isinf
a8c590f789 math: Use wordsize-64 version for isinf
2666f96390 powerpc: Remove optimized isnan
197dbda1a1 math: Use wordsize-64 version for isnan
2731a326b1 benchtests: Add isnan/isinf/isfinite benchmark
e41d66e41a powerpc: copysign cleanup
21bd039bb4 powerpc: consolidate rint
cfa611447b libio: freopen of default streams crashes in old programs [BZ #24632]
744e829637 Linux: Deprecate <sys/sysctl.h> and sysctl
5dad6ffbb2 <sys/stat.h>: Use Linux UAPI header for statx if available and useful
4e75c2a43b <sys/cdefs.h>: Add __glibc_has_include macro
680942b016 Improve performance of memmem
5e0a7ecb66 Improve performance of strstr
80b2bfb535 Benchmark strstr hard needles
e6e2424390 Fix malloc tests build with GCC 10.
ignatenkobrain commented 4 years ago

If I read it correctly, all list of changes are:

❯ git log --oneline 51ea67d54882318c4fa5394c386f4816ddc22408..21cc130b78a4db9113fb6695e2b951e697662440
21cc130b78 libio: do not attempt to free wide buffers of legacy streams [BZ #24228]
49bc41b642 [powerpc] add 'volatile' to asm
335c1007bf powerpc: Fix static-linked version of __ppc_get_timebase_freq [BZ #24640]
f59a54ab0c nl_AW locale: Correct the negative monetary format (bug 24614).
f0c5a803bd Fix gcc 9 build errors for make xcheck. [BZ #24556]
fabf5e49dd dlfcn: Avoid one-element flexible array in Dl_serinfo [BZ #24166]
2c75b545de elf: Refuse to dlopen PIE objects [BZ #24323]
02d8b5ab1c nl_NL locale: Correct the negative monetary format (bug 24614).
112a0ae18b m68k: Remove vDSO support
dee07df1a4 powerpc: Refactor powerpc64 lround/lroundf/llround/llroundf
2166283fcc powerpc: Refactor powerpc32 lrint/lrintf/llrint/llrintf
78049de0a9 powerpc: refactor powerpc64 lrint/lrintf/llrint/llrintf
48c3c12389 Linux: Fix __glibc_has_include use for <sys/stat.h> and statx
8d141877e0 <sys/cdefs.h>: Inhibit macro expansion for __glibc_has_include
cf27468602 Add IPV6_ROUTER_ALERT_ISOLATE from Linux 5.1 to bits/in.h.
a26e2e9fea Allow memset local PLT reference for powerpc soft-float.
82bc69c012 aarch64: handle STO_AARCH64_VARIANT_PCS
55f82d328d aarch64: add STO_AARCH64_VARIANT_PCS and DT_AARCH64_VARIANT_PCS
1192696069 powerpc: Remove optimized finite
a72186761b math: Use wordsize-64 version for finite
6427a6ac8c powerpc: Remove optimized isinf
a8c590f789 math: Use wordsize-64 version for isinf
2666f96390 powerpc: Remove optimized isnan
197dbda1a1 math: Use wordsize-64 version for isnan
2731a326b1 benchtests: Add isnan/isinf/isfinite benchmark
e41d66e41a powerpc: copysign cleanup
21bd039bb4 powerpc: consolidate rint
cfa611447b libio: freopen of default streams crashes in old programs [BZ #24632]
744e829637 Linux: Deprecate <sys/sysctl.h> and sysctl
5dad6ffbb2 <sys/stat.h>: Use Linux UAPI header for statx if available and useful
4e75c2a43b <sys/cdefs.h>: Add __glibc_has_include macro
680942b016 Improve performance of memmem
5e0a7ecb66 Improve performance of strstr
80b2bfb535 Benchmark strstr hard needles
e6e2424390 Fix malloc tests build with GCC 10.
ignatenkobrain commented 4 years ago

@grobian while looking into a strace output, it definitely was able to bind to 3020 port.

[pid 1559508] bind(1, {sa_family=AF_INET, sin_port=htons(3020), sin_addr=inet_addr("127.0.0.1")}, 16) = 0

But I don't see where it binds to it again....

fweimer commented 4 years ago

Sorry, I don't know what is going on.

grobian commented 4 years ago
[pid 1559539] socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 1
[pid 1559539] setsockopt(1, SOL_SOCKET, SO_RCVTIMEO_OLD, "\0\0\0\0\0\0\0\0 \241\7\0\0\0\0\0", 16) = 0
[pid 1559539] setsockopt(1, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
[pid 1559539] bind(1, {sa_family=AF_INET, sin_port=htons(3020), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EADDRINUSE (Address already in use)
grobian commented 4 years ago

I think what could happen is that tests execute too fast after each other or something and for some reason the connection is still lingering around. Oddly enough this shouldn't be an issue, when the connection is closed properly. It could be some tuning setting that explains why on some envs this works, and others it doesn't. Purely hypothetical. I'm thinking the best attack may be to randomise the starting port (in absence of a method to test a port being free), such that this timing issue (theory above) doesn't happen.

bzed commented 4 years ago

I'm seeing the same issue in Debian, on our buildds and also in the CI. So I've added some code to use a random port, but this didn't help. It doesn't fail in docker...

bzed commented 4 years ago
--- basic.payloadout
+++ basic.payloadout
@@ -7,5 +7,4 @@
 bar 1 3
 bar.foo 1 4
 rewrite.foo 1 4
-aggregate.foo.bar 1.000000 349830000
 relay 1: aggregator: dropping incorrect metric: foo.bar 1

That starts to look like a timing issue to me. Basically the aggregate output is missing.

grobian commented 4 years ago

yes, but the amount of time it sleeps ... it should produce the aggregate on a window end

You could try and increase the sleep, I wonder if that helps

AaronBandt commented 1 year ago

I know this is old but I ran into the same issue trying to build 3.7.4 on Rocky 8.6 and I can confirm that increasing the sleep in the test script from 2 to 4 seconds does indeed allow the aggregation test to pass.

grobian commented 1 year ago

Ok, let's just up that sleep then.