Closed Civil closed 8 years ago
Also, no new rules were added, the only change that happend to the config - extra servers in upstream section. SIGHUP caused carbon-c-relay (v2.1 + commits to fix sighup) to exit and after that it started to consume twice amount of CPU and started to constantly crashing with SIGSEGV.
I'll try to fix TSAN errors and will make extra PR for them. The current one should fix only the crash itself.
And also read after free during shutdown:
[2016-07-01 12:00:57] shutting down...
[2016-07-01 12:00:57] closed listeners for port 2003
[2016-07-01 12:00:57] stopped collector
[2016-07-01 12:00:58] stopped aggregator
=================================================================
==20613==AddressSanitizer: while reporting a bug found another one. Ignoring.
==20613==AddressSanitizer: while reporting a bug found another one. Ignoring.
==20613==AddressSanitizer: while reporting a bug found another one. Ignoring.
==20613==AddressSanitizer: while reporting a bug found another one. Ignoring.
==20613==AddressSanitizer: while reporting a bug found another one. Ignoring.
==20613==AddressSanitizer: while reporting a bug found another one. Ignoring.
==20613==AddressSanitizer: while reporting a bug found another one. Ignoring.
==20613==AddressSanitizer: while reporting a bug found another one. Ignoring.
==20613==ERROR: AddressSanitizer: heap-use-after-free on address 0x60600000ebb8 at pc 0x00000053dba3 bp 0x7fb5dbae8f70 sp 0x7fb5dbae8f68
==20613==AddressSanitizer: while reporting a bug found another one. Ignoring.
==20613==AddressSanitizer: while reporting a bug found another one. Ignoring.
==20613==AddressSanitizer: while reporting a bug found another one. Ignoring.
==20613==AddressSanitizer: while reporting a bug found another one. Ignoring.
==20613==AddressSanitizer: while reporting a bug found another one. Ignoring.
==20613==AddressSanitizer: while reporting a bug found another one. Ignoring.
==20613==AddressSanitizer: while reporting a bug found another one. Ignoring.
==20613==AddressSanitizer: while reporting a bug found another one. Ignoring.
==20613==AddressSanitizer: while reporting a bug found another one. Ignoring.
==20613==AddressSanitizer: while reporting a bug found another one. Ignoring.
==20613==AddressSanitizer: while reporting a bug found another one. Ignoring.
READ of size 8 at 0x60600000ebb8 thread T8
#0 0x53dba2 in aggregator_putmetric /usr/src/carbon-c-relay/aggregator.c:205:13
#1 0x52bb9e in router_route_intern /usr/src/carbon-c-relay/router.c:2899:7
#2 0x529b5b in router_route /usr/src/carbon-c-relay/router.c:2980:8
#3 0x503ed9 in dispatch_connection /usr/src/carbon-c-relay/dispatcher.c:404:25
#4 0x502b97 in dispatch_runner /usr/src/carbon-c-relay/dispatcher.c:585:13
#5 0x7fb5e413e493 in start_thread (/lib64/libpthread.so.0+0x7493)
#6 0x7fb5e3a645dc in clone (/lib64/libc.so.6+0xe95dc)
0x60600000ebb8 is located 24 bytes inside of 64-byte region [0x60600000eba0,0x60600000ebe0)
freed by thread T23 here:
#0 0x4c1210 in free (/usr/src/carbon-c-relay/relay+0x4c1210)
#1 0x5422a9 in aggregator_expire /usr/src/carbon-c-relay/aggregator.c:609:3
#2 0x7fb5e413e493 in start_thread (/lib64/libpthread.so.0+0x7493)
previously allocated by thread T0 here:
#0 0x4c1518 in __interceptor_malloc (/usr/src/carbon-c-relay/relay+0x4c1518)
#1 0x53c9f3 in aggregator_new /usr/src/carbon-c-relay/aggregator.c:53:20
#2 0x5147f8 in router_readconfig /usr/src/carbon-c-relay/router.c:1312:29
#3 0x4f4525 in main /usr/src/carbon-c-relay/relay.c:630:13
#4 0x7fb5e399b61f in __libc_start_main (/lib64/libc.so.6+0x2061f)
Thread T8 created by T0 here:
#0 0x42c129 in __interceptor_pthread_create (/usr/src/carbon-c-relay/relay+0x42c129)
#1 0x5014b9 in dispatch_new /usr/src/carbon-c-relay/dispatcher.c:625:6
#2 0x501547 in dispatch_new_connection /usr/src/carbon-c-relay/dispatcher.c:654:9
#3 0x4f5032 in main /usr/src/carbon-c-relay/relay.c:722:21
#4 0x7fb5e399b61f in __libc_start_main (/lib64/libc.so.6+0x2061f)
Thread T23 created by T0 here:
#0 0x42c129 in __interceptor_pthread_create (/usr/src/carbon-c-relay/relay+0x42c129)
#1 0x53facf in aggregator_start /usr/src/carbon-c-relay/aggregator.c:655:6
#2 0x4f52c5 in main /usr/src/carbon-c-relay/relay.c:751:8
#3 0x7fb5e399b61f in __libc_start_main (/lib64/libc.so.6+0x2061f)
SUMMARY: AddressSanitizer: heap-use-after-free /usr/src/carbon-c-relay/aggregator.c:205:13 in aggregator_putmetric
Shadow bytes around the buggy address:
0x0c0c7fff9d20: fa fa fa fa fa fa fa fa fa fa fa fa fd fd fd fd
0x0c0c7fff9d30: fd fd fd fd fa fa fa fa fd fd fd fd fd fd fd fd
0x0c0c7fff9d40: fa fa fa fa fd fd fd fd fd fd fd fa fa fa fa fa
0x0c0c7fff9d50: fd fd fd fd fd fd fd fa fa fa fa fa fd fd fd fd
0x0c0c7fff9d60: fd fd fd fd fa fa fa fa fd fd fd fd fd fd fd fd
=>0x0c0c7fff9d70: fa fa fa fa fd fd fd[fd]fd fd fd fd fa fa fa fa
0x0c0c7fff9d80: 00 00 00 00 00 00 00 00 fa fa fa fa fd fd fd fd
0x0c0c7fff9d90: fd fd fd fd fa fa fa fa 00 00 00 00 00 00 00 fa
0x0c0c7fff9da0: fa fa fa fa 00 00 00 00 00 00 00 00 fa fa fa fa
0x0c0c7fff9db0: 00 00 00 00 00 00 02 fa fa fa fa fa 00 00 00 00
0x0c0c7fff9dc0: 00 00 00 01 fa fa fa fa 00 00 00 00 00 00 00 00
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable: 00
Partially addressable: 01 02 03 04 05 06 07
Heap left redzone: fa
Heap right redzone: fb
Freed heap region: fd
Stack left redzone: f1
Stack mid redzone: f2
Stack right redzone: f3
Stack partial redzone: f4
Stack after return: f5
Stack use after scope: f8
Global redzone: f9
Global init order: f6
Poisoned by user: f7
Container overflow: fc
Array cookie: ac
Intra object redzone: bb
ASan internal: fe
Left alloca redzone: ca
Right alloca redzone: cb
==20613==ABORTING
It seems that this is related to the amount of incoming connections - the more we have the less stable everything is.
So it might be related to some of the races in dispatcher.
The race condition is? But isn't it fixed with your patch now?
Crash fixed, but aggregator is not sending any data out except for it's own metrics.
It seems that there are two separate problems:
Second problem is still there. On one host I've worked around that with placing yet another local carbon-c-relay that just sends all the data to aggregator (and moving aggregator to another port).
Well, is the thing under load? The problem is that there is a sweet spot where the load of metrics cause all dispatchers to wait for the lock (which you just made bigger) and basically not do anything any more because the whole thing gets horribly behind with anything it does.
That happens under load, right. Though the load is very different and when I'm getting backtrace I don't see anything waiting on locks, top10 according to pprof:
16461 31.7% 31.7% 18869 36.3% fnmatch ??:0
6770 13.0% 44.7% 6770 13.0% aggregator_expire /usr/src/carbon-c-relay/aggregator.c:404
5213 10.0% 54.7% 5213 10.0% pthread_setcanceltype ??:0
2763 5.3% 60.1% 2763 5.3% aggregator_expire /usr/src/carbon-c-relay/aggregator.c:403
1577 3.0% 63.1% 1577 3.0% aggregator_expire /usr/src/carbon-c-relay/aggregator.c:402
1553 3.0% 66.1% 1573 3.0% __h_errno ??:0
1073 2.1% 68.1% 1073 2.1% aggregator_expire /usr/src/carbon-c-relay/aggregator.c:399
970 1.9% 70.0% 970 1.9% dispatch_connection /usr/src/carbon-c-relay/dispatcher.c:387
790 1.5% 71.5% 790 1.5% dispatch_connection /usr/src/carbon-c-relay/dispatcher.c:438
671 1.3% 72.8% 671 1.3% regexec ??:0
I can try to get gdb backtrace in that case, but it also seems legit to me.
Also amount of received metrics haven't chagned, but amout of destinations and sources are.
More destinations mean that more work is being done by dispatchers (sending to more targets). aggregator.c is expiry doing expiration on collected buckets, what's the CPU usage of the relay when it does this?
around 150% when it's not stuck, and 400% when it's stuck.
Also sometimes restart can fix a problem.
And anyway, it should send the data, at least some of them. If it's not sending because of load, then there is a deadlock somewhere and that's a bug.
so the expiry thread could possibly be cpu bound
In that case I should actually see some stats from the aggregator that it's expired everything, for example. And also why restart helps?
it will actually pile up things, because it can't expire in time, and since that happens, metrics may be dropped at the front door because there is not enough buckets to hold the future
a restart helps in such scenario because it clears all aggregations
I've seen this happening with dynamic aggregates where the invocations are slowly increasing, at the point where expiry no longer makes it in time to release buckets things go haywire.
Yeah, that makes sense. I'll try to shift some load from it then.
that's what I ended up doing yeah
Well, seems to be fixed. Except that TSAN errors still should be addressed, cause all of them are errors that can potentially produce corrupted data in aggregator.
If you want I can open separate bug about TSAN errors and ASAN overflow during exit.
please do
In some rare cases (don't have 100% way to reproduce) there are 2 bugs that happens simultaneously:
Here's report from ASAN:
That's because of race condtion that happens here: https://github.com/grobian/carbon-c-relay/blob/master/aggregator.c#L240-L245
Thread1 unlock mutex on line 244, thread2 (expire) gets it and frees the invocation variable that thread1 is currently using and only after that thread2 gets the mutex locked.
Also report from TSAN (also might be useful to get an insight about what's going on): https://gist.github.com/Civil/e4aaffb76be7e216cc4f259e162b70b9
But even if I change lock here https://github.com/grobian/carbon-c-relay/blob/master/aggregator.c#L240 to be write-lock (and remove unlock-lock from lines 244-245) it still fails to send any data out (but at least doesn't segfaults).