High cpu_system with many open connections

chbuescher commented 8 years ago

Hi again, in the meantime - see issue #126 - we have about 6.500 open connections and 2.5 Mio metrics every 15 seconds. The cpu usage increased to 240% with 8 workers (with 16 workers it went over 300%). I looked for a solution to handle even more connections, for we are planning another rollout. I found libevent, a library for handling a large number of open connections. After promising tests of a simple libevent-based TCP server i modified carbon-c-relay to use libevent. Now i have a working version based on version 2.2, that - after flawlessly running in our test env for one week - we now run in production. The cpu usage in production went down to 60% - in the test env (2.900 connections) from 50% to 20%. In my tests the libevent based relay scales linearly up to 15.000 connections when testing with 100 metrics per connection and an interval of 15 seconds. What you loose is a bit of control: No throttling of fast event sources. And we don't use aggregation, so we haven't tested that. Up to number of connections of about 1000 your solution seems to be a bit more efficient. Are you interested in my solution ? Diff, python stress script ...

grobian commented 8 years ago

It sounds interesting to say the least. I'd like to see what you did, it sounds like it is an improvement nonetheless!

chbuescher commented 8 years ago

relay-libevent.zip In the zip file you can find the diff to version 2.2/commit 2e76c18c3d7387ee689ee233ec4c65fa68eee303 and the files to stress test the relay: a dummy tcp socket server listening on port 5555, the stresser script, that opens many sockets to the graphite relay and the test configuration for the relay. Most changes are made in dispatcher.c, the other changes are due to the interface changes in the dispatcher functions. We still use RHEL6 on our servers, and the libevent library 1.4.x isn't threadsafe, so all the libevent stuff must be done in one thread, the worker 0 : the libevent callback functions dispatch_accept_cb_xxx when there is a new connection, the callback function dispatch_connread_cb for changes in the state of a client connection and the timer callback dispatch_timer_cb for un-/registering the listener sockets. I have moved the reading of the socket data and preparsing of the read data - searching the last newline char - into dispatch_connread_cb to ensure fast reading of new data and freeing most of the buffer as soon as possible. The new data is then copied to a new struct called metric_block and then inserted in the todo queue of one of the workers. All connections are handled inside libevent, so i removed the connections list completely - and the udp connections are now handled by the listeners list. The cpu time in the worker 0 libevent callbacks is measured by the new counter listener_wallTime_us, but i can't measure the libevent internal loop stuff.

Civil commented 8 years ago

To overcome issues with older libevent (thread-safeness) you might want to create a configuration system and in case if libevent <2.0 static link carbon-c-relay with bundled version.

chbuescher commented 8 years ago

I handle all libevent stuff in only one thread, and i have validated it with helgrind and drd, so i don't expect any thread-safety issues.

Civil commented 8 years ago

I handle all libevent stuff in only one thread, and i have validated it with helgrind and drd, so i don't expect any thread-safety issues.

Yeah, I only meant that some distros (Ubuntu 14.04+, Debian Stable) ships libevent 2.0 and newer versions of libevent are thread-safe and you can keep the code multi-threaded.

chbuescher commented 7 years ago

Found a segmentation fault when using aggregations: dispatch_initlisteners() must be called before the invocation of router_readconfig() in relay.c

Civil commented 6 years ago

I just wanted to say that we ended up rebasing @chbuescher patchset on top of v2.6 (the last one that was stable for us) and it helped to solve one of our performance problem with one of the clusters.

Briefly about the problem - one of the metric producers started sending twice more data in a single connection (we are talking about 30 connections per relay, and bulk sends of 2-3M points every 10 seconds from each box). And then they realized that instead of 2 seconds it now takes 6 minutes to send data out.

With this patch (well, to be honest I've just cherry-picked all the commits from the fork, so it also includes GNU Regexp -> PCRE, but according to perf top it's not related) it now takes 4 seconds to accept and relay the data.

chbuescher commented 6 years ago

PCRE only helps if you use matches (blackhole, aggregate...). Recently we enabled the PCRE JIT compiler - just by using the option PCRE_STUDY_JIT_COMPILE in pcre_study (in router.c). The worker threads now use 50% less cpu time with our configuration.

Civil commented 6 years ago

PCRE was just nice to have, that's it. It haven't helped our case a lot.

Though I think while porting that to 2.6 I've accidentally reverted some fixes for 'send statistics', cause it triggered one of the bugs we had while using 2.3-2.5 releases. But I'll investigate that later.

grobian commented 6 years ago

Send statistics has a bug?

Civil commented 6 years ago

In the version that @chbuescher patched. For some reason it resurrects one of the old bugs when statistics is malformed.

chbuescher commented 6 years ago

I have updated my fork of your project to version 2.6 and enabled PCRE JIT compile. It works fine in our preproduction environment. But version 2.6 had a bug with send statistics and blanks ...

Civil commented 6 years ago

@chbuescher well, we haven't saw this bug in our production environment.

I've tried to run your fork, it segfaults on shutdown with the following backtrace:

=================================================================
==89509==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x00010bc97923 bp 0x70000791b6d0 sp 0x70000791b6c0 T1)
==89509==The signal is caused by a READ memory access.
==89509==Hint: address points to the zero page.
    #0 0x10bc97922 in event_del (libevent-2.1.6.dylib:x86_64+0xf922)
    #1 0x10bbfbefd in dispatch_timer_cb dispatcher.c:624
    #2 0x10bc99fc9 in event_process_active_single_queue (libevent-2.1.6.dylib:x86_64+0x11fc9)
    #3 0x10bc96cce in event_base_loop (libevent-2.1.6.dylib:x86_64+0xecce)
    #4 0x10bbfabbe in dispatch_runner dispatcher.c:694
    #5 0x7fffe50de93a in _pthread_body (libsystem_pthread.dylib:x86_64+0x393a)
    #6 0x7fffe50de886 in _pthread_start (libsystem_pthread.dylib:x86_64+0x3886)
    #7 0x7fffe50de08c in thread_start (libsystem_pthread.dylib:x86_64+0x308c)

==89509==Register values:
rax = 0xbebebebebebebebe  rbx = 0x000060f0000016d0  rcx = 0x000060f0000016c8  rdx = 0x00001c1e000002d9
rdi = 0x000060f0000016d0  rsi = 0x00000b0000000c00  rbp = 0x000070000791b6d0  rsp = 0x000070000791b6c0
 r8 = 0x000000010bc597b8   r9 = 0x0000000000dcaae3  r10 = 0x000000011b3d7110  r11 = 0x000000010bc97910
r12 = 0x0000000000000000  r13 = 0x0000613000000668  r14 = 0x00000000ffffffff  r15 = 0x0000617000000001
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV (libevent-2.1.6.dylib:x86_64+0xf922) in event_del
Thread T1 created by T0 here:
    #0 0x10bd97716 in wrap_pthread_create (libclang_rt.asan_osx_dynamic.dylib:x86_64h+0x50716)
    #1 0x10bbf9e65 in dispatch_new dispatcher.c:768
    #2 0x10bbf994c in dispatch_new_listener dispatcher.c:787
    #3 0x10bbe7a6d in main relay.c:757
    #4 0x7fffe4ec5234 in start (libdyld.dylib:x86_64+0x5234)

==89509==ABORTING

(As I'm running tests on my Macbook I've patched it a little bit to compile on OSX (OS X doesn't have memrchr, so I've copied memrchr implementation from FreeBSD and include it if APPLE is defined.

And I still see malformed stats:

carbon.relays.C02PK283FVH6.destinations.127_11_111_11:2003.dropped 0 15130956carbon.relays.C02PK283FVH6.destinations.127_11_111_12:2003.dropped 0 1513095634
<...>
carbon.relays.carbon.relays.C02PK283FVH6.destinations.127_11_11_13:2003.dropped 0 1513095634

It's there since the first commit and I think there is something wrong in logic in dispatcher for large batches. I haven't found the problem yet. And it's reproducible on our pre-prod environment on normal Linux.

Civil commented 6 years ago

That bug with malformed sends doesn't happen if I use memchr instead of memrchr.

EDIT: no it haven't, just delayed the time before problem appeared again

Civil commented 6 years ago

The way how the conn->buff fills allows it to have incomplete metrics at this point, but only in case if the batch is large.

I'm able to reproduce that by specifying 300 servers (splitted by 10 clusters) in the config, then carbon-c-relay's own stats are larger than buff and metrics got appended one to each other.

chbuescher commented 6 years ago

Using memchr will have a performance impact. The idea is, that worker 0, the thread that does all libevent/socket work, does only minimal parsing of the read data (split it at the last separator char) and puts the data on a queue for consumption by the other worker threads. Using memchr will split the data at each separator and put each each line on the queue.

chbuescher commented 6 years ago

The statistics collector sends its metrics through a pipe to the dispatcher. Unfortunately there seem to more then one open connections for this pipe, and therefore several buffers. So the dispatch_connread_cb does not append the new metrics to the right buffer.

Civil commented 6 years ago

I think it's writing data to that pipe in one thread, but there are some problems with how you store partial metrics.

chbuescher commented 6 years ago

I'm pretty sure that partial data handling works ok with tcp connections. I've committed some patches in my fork, so that collector and aggregator send theirs metrics via tcp connection over localhost. Please give it a try.

Civil commented 6 years ago

Yeah, sending over TCP seems to fix the problem

RudW0lf commented 6 years ago

Hello. I have the same issues with original carbon-c-relay. Forked carbon-c-relay solved my issue. Our agent send data one large batch (2-3M per 10-20 secs). Non-patched version dropped huge part of metrics (see attached screenshot).

Will these patches be included in the main branch?

grobian commented 6 years ago

They quite diversed from the main branch, that makes it somewhat non-trivial. Intention, however, is to include them or the idea behind it in the future.

azhiltsov commented 6 years ago

We have faced the issue when under high load/high event rate it corrupts metrics. Normally we have around 170K points a second per relay, but during the burst we had around 290K points a second per relay. We are using set of 6 relays to handle this load. The configuration consist of 4 replicas with 24 servers within each (4x24) total. Corruption happens only on 3 of them, except first replica in configuration file. Metrics are getting corrupted by adding garbage at front of the metric name: The config looks like:

cluster a
 jump_fnv1a_ch replication 1
 24 nodes
cluster b
 jump_fnv1a_ch replication 1
 24 nodes
cluster c
 jump_fnv1a_ch replication 1
 24 nodes
cluster d
 jump_fnv1a_ch replication 1
 24 nodes

match general
 send to
  a
  b
  c
  d
stop;

chbuescher commented 6 years ago

We never used replication based on CARBON or FNV1a hashes, so we didn't see that error yet. But i've made a optimization when searching the correct position in the hash ring. Could you please test, if the errror also shows up with the unmodified consistent-hash.c from version/tag v2.6 ?

chbuescher commented 6 years ago

Sorry, didn't want to close this..,.

grobian / carbon-c-relay

High cpu_system with many open connections #216