3.1 produces lots of close_wait sockets with unread bytes in their receive queue

mwtzzz-zz commented 7 years ago

I'm testing 3.1 on one of my relay hosts. The first thing I noticed is that the number of CLOSE_WAIT sockets with unread bytes in Recv-Q climbs steadily until about 3,000 where it stays - I assume it has hit some system-imposed limit at this point:

tcp 116 0 172.17.25.160:2001 172.17.29.171:26172 CLOSE_WAIT 13084/relay tcp 1 0 172.17.25.160:2001 172.17.29.171:25234 CLOSE_WAIT 13084/relay

You'll see in the above example that there are 116 unread bytes in one such socket, and 1 unread byte in another. As a result, it is impossible to gracefully terminate the relay process; the only way to stop it is with kill -9.

I do not see this behavior on my hosts running carbon-c-relay-1.11. On those hosts, there are no lingering CLOSE_WAIT connections, everything in the receive queue gets processed.

grobian commented 7 years ago

I've never seen this behaviour before. These are incoming connections? (e.g. you're running on port 2001?) Nothing off the top of my head would explain why the relay wouldn't read data, it closes when it finds EOF, unless it times out reading or something, then it will disconnect. What kind of clients are these?

mwtzzz-zz commented 7 years ago

These are incoming connections (relay listens on port 2001). The clients are various linux hosts in EC2 running our applications. The clients connect to the relay via an ELB. The clients run a mix of collectd, and a custom graphite client that basically netcats metrics to the ELB.

I downloaded the source and compiled it using make, I didn't give it any special options. For the first 30 minutes or so, the throughput is about 1/8 of the relays running 1.11, then it drops to zero.

I might try playing around with different 2.x versions and see if it has the same behavior. Other than that, I'm not sure what could be going on.

mwtzzz-zz commented 7 years ago

update: I compiled v2.2 and ran it without the -U option .. Still seeing the close_wait problem.

mwtzzz-zz commented 7 years ago

update #2: I compiled v2.1 and it runs fine, no problems.

so something between 2.1 and 2.2 changed which causes this issue on our systems.

deniszh commented 7 years ago

@mwtzzz : sorry for the intrusion, but you can use git bisect to easily find exact commit, which causing an issue.

mwtzzz-zz commented 7 years ago

@deniszh I'm using git bisect, but immediately running into a bison error when running "make":

[mmartinez@ec2- radar112 ~]$ git clone https://github.com/grobian/carbon-c-relay.git carbon-c-relay
Cloning into 'carbon-c-relay'... remote: Counting objects: 4257, done. remote: Compressing objects: 100% (4/4), done. remote: Total 4257 (delta 1), reused 1 (delta 0), pack-reused 4253 Receiving objects: 100% (4257/4257), 1.87 MiB | 0 bytes/s, done. Resolving deltas: 100% (2913/2913), done. Checking connectivity... done. [mmartinez@ec2- radar112 ~]$ cd carbon-c-relay [mmartinez@ec2- radar112 carbon-c-relay]$ git bisect start [mmartinez@ec2- radar112 carbon-c-relay]$ git bisect bad [mmartinez@ec2- radar112 carbon-c-relay]$ git bisect good v2.1 Bisecting: 160 revisions left to test after this (roughly 7 steps) [bbbd6ed920f2b435fafa48e6595f5939e60dddc8] conffile: implemented include

[mmartinez@ec2- radar112 carbon-c-relay]$ make cc -O2 -Wall -Wshadow -DGIT_VERSION=\"bbbd6e\" -pthread -c -o relay.o relay.c cc -O2 -Wall -Wshadow -DGIT_VERSION=\"bbbd6e\" -pthread -c -o md5.o md5.c cc -O2 -Wall -Wshadow -DGIT_VERSION=\"bbbd6e\" -pthread -c -o consistent-hash.o consistent-hash.c cc -O2 -Wall -Wshadow -DGIT_VERSION=\"bbbd6e\" -pthread -c -o receptor.o receptor.c cc -O2 -Wall -Wshadow -DGIT_VERSION=\"bbbd6e\" -pthread -c -o dispatcher.o dispatcher.c bison -d conffile.y conffile.y:35.20-30: error: syntax error, unexpected {...} make: *** [conffile.tab.c] Error 1 [mmartinez@ec2- radar112 carbon-c-relay]$

cbowman0 commented 7 years ago

I believe you need bison version 3.

grobian commented 7 years ago

You can touch the produced files, I checked them into the repo for this reason.

touch conffile.yy.c conffile.tab.c conffile.tab.h
touch configure.ac Makefile.am aclocal.m4 configure Makefile.in config.h.in

this should work (I do this for the travis runs)

Thanks for trying to find the culprit!

mwtzzz-zz commented 7 years ago

@grobian thanks the suggestion worked. Currently running through git bisect, ....

mwtzzz-zz commented 7 years ago

Ok, here's what I've narrowed it down to:

[mmartinez@ec2 radar112 carbon-c-relay]$ git bisect bad
Bisecting: 0 revisions left to test after this (roughly 1 step)
[e12b412e263905c552826c4aa3855c92be7a6be7] aggregator_expire: run entire invocation loop under lock
[mmartinez@ec2 radar112 carbon-c-relay]$ git bisect bad
Bisecting: 0 revisions left to test after this (roughly 0 steps)
[c3a4341837a90d774147115ff0116a13d614bdfb] dispatcher: move struct init before thread forking
[mmartinez@ec2 radar112 carbon-c-relay]$ git bisect good
e12b412e263905c552826c4aa3855c92be7a6be7 is the first bad commit
commit e12b412e263905c552826c4aa3855c92be7a6be7
Author: Fabian Groffen <grobian@gentoo.org>
Date:   Sun Sep 11 10:29:22 2016 +0200

    aggregator_expire: run entire invocation loop under lock

    Access to the invocation buckets happens concurrently, so we need to
    lock down the entire loop to make it safe.  A better strategy for
    aggregations is necessary.

:100644 100644 ede3a59c0f35769546f65c632272e1531133987f 60b890d6516591d4a28b7a4cea175a702e9c317a M      aggregator.c

grobian commented 7 years ago

So that aggregator_expire commit is the first one to produce the unread sockets? Are you using aggregations in your configs?

mwtzzz-zz commented 7 years ago

Yes to both your questions. The aggregator_expire commit is the first one to produce unread sockets, and yep we are using aggregations.

grobian commented 7 years ago

Ok, I think I might know what direction to search for. It may be solved by PR #274.

mwtzzz-zz commented 7 years ago

@grobian That's good to hear. Let me know when you're ready for me to test.

grobian commented 7 years ago

I'd be interested to know if applying the patch from PR #274 solves the close_wait sockets problem.

mwtzzz-zz commented 7 years ago

Testing now ...

mwtzzz-zz commented 7 years ago

I put the new aggregator.c into v3.1, compiled it and ran it. It looks much better. Still see a single lingering CLOSE_WAIT with a single unread byte: tcp CLOSE-WAIT 1 0 172.17.25.160:48207 172.17.24.203:2001

But it's not interfering with anything and I can still gracefully stop and start service. So far, it looks like your changes have fixed the issue. I'll let it run today and keep an eye on it.

grobian commented 7 years ago

ok, that's good to hear

mwtzzz-zz commented 7 years ago

Ran it all night, it's working fine. I'm going to roll it out to production. Thanks for working on this issue!

mwtzzz-zz commented 7 years ago

I rolled it out to all our production clusters yesterday and it's working great. To give you an idea of our throughput, we're writing about 15million metrics / minute. We've got a cluster of 10 i3.xlarge instances running only the relay, each of these hosts is doing a network throughput of just over 1GB bytes .We've got a backend cluster of 12 i3.2xlarge instances running the relay + carbon-cache. I've got various network stack stuff tuned on the relay layer and running relay with B 4096 -U 16777216.

mwtzzz-zz commented 7 years ago

@grobian Out of curiosity, do you know of other companies that are writing a similar (or more) quantity of metrics as us?

grobian commented 7 years ago

http://events.linuxfoundation.org/sites/events/files/slides/booking-graphite-atscale-linuxconeur2k16.pdf

At Booking.com they say they push 1million metrics/second, (thus 60million/minute). In a later slide they even mention 2million, the 8million is because there is DR (x2) and replication=2 (x2=x4).

cldellow commented 7 years ago

Another datapoint: we (sortable.com; one of my coworkers is the person who put together #274) are doing 2-3 million metrics/minute to 1 carbon-c-relay on a c4.2xlarge, which then forwards to 2 i3.xlarges running go-carbon. We aggregate heavily, though, so only ~400k metrics/minute leave the carbon-c-relay instance.

mwtzzz-zz commented 7 years ago

@cldellow Our number (15million/minute) takes place on 10 carbon-c-relays and 12 backends, so it looks like your number would be about twice ours. What kind of tuning, if any, have you done on your c4.2xlarge? My bottleneck right now is not the relay layer but the backend.

cldellow commented 7 years ago

No tuning that I can recall. It sounds like your backends are much busier than ours, so I don't think we'd have anything useful to say there unfortunately :(

szibis commented 7 years ago

We do more then 30 milions / minute from top carbon-c-relays (6 instances c4.xlarge) and they sending hashed and replicated (factor 2) traffic to 5 instances i2.4xlarge with go-carbon

Then top relays producing more then 60 milions metrics / minute to carbon backends

Inside go-carbon instances:

150k iops in sum
180 GB dirty pages in sum
about 50 cores used only for go-carbon alone all the time on all 5 instances

We are making only some medium aggregation and data are matched and sended to one c4.xlarge with second as failover from top carbon-c-relay. Then from this aggregators data goes to this 5 nodes cluster with go-carbon.

mwtzzz-zz commented 7 years ago

@szibis What kind of tuning have you done on the 6 carbon-c-relay instances? I was running into a network bottleneck (dropped metrics, dropped packets) with 8 i3.xlarge; I had to add two instances, bringing it to a total of ten, to alleviate the bottleneck.

What's the tenancy attribute of your 6 carbon-c-relay instances?

szibis commented 7 years ago

@mwtzzz mostly high batch sizes. Each go-carbon instance takes about 20MB/s of traffic which is bellow AWS instance limits, any I use.

mwtzzz-zz commented 7 years ago

@szibis What batch sizes are you using? Are you specifying it with -B? I'm currently running relay -q 400000 -B 4096 -U 16777216

szibis commented 7 years ago

/usr/bin/relay -p 2013 -w 32 -b 40000 -q 30000000 -B 32 -T 1000 -f /etc/carbon-c-relay/relay.conf

And go-carbon instances as data stores are highly tuned to be able to take all that traffic smoothly.

mwtzzz-zz commented 7 years ago

thanks, I'm going to try out those settings and see if they make a difference.

mwtzzz-zz commented 7 years ago

@szibis by the way, how are you getting 32 cores on a c4.xlarge? This instance type only has four cores:

[salt-master2 ~]$ salt-call grains.get instance_type; nproc --all
local:
    c4.xlarge
4

grobian / carbon-c-relay

3.1 produces lots of close_wait sockets with unread bytes in their receive queue #281