grobian / carbon-c-relay

Enhanced C implementation of Carbon relay, aggregator and rewriter
Apache License 2.0
380 stars 107 forks source link

2.6+ Round-Robin DNS Entry forward any_of causes hung connections #262

Closed g2bg closed 7 years ago

g2bg commented 7 years ago

On 2.5 the functionality was that it would pick 1 ip from the DNS entry to forward to and stick with it. In 2.6 and 3.0 it will connect to 1 of the ip's from the DNS entry and send metrics then leaves the connection open. This leads to ever increased connection counts until it runs out of file descriptors.

So 2 prong request would be

  1. Fix connection close issue
  2. in any_of load all ip's from a multi-ip dns resolution

[2017-04-21 17:26:56] (MSG) starting carbon-c-relay v3.0 (98424a-dirty), pid=39656 configuration: relay hostname = server_name listen port = 2003 listen interface = 127.0.0.1 workers = 4 send batch size = 2500 server queue size = 25000 server max stalls = 4 listen backlog = 32 server connection IO timeout = 600ms debug = true configuration = /etc/carbon-c-relay.conf

parsed configuration follows: statistics submit every 60 seconds prefix with carbon.relays.server_name ;

cluster local_carbon any_of multiipdnsentry:2003 ;

match * send to local_carbon stop ;

[2017-04-21 17:26:56] (MSG) listening on tcp4 127.0.0.1 port 2003 [2017-04-21 17:26:56] (MSG) listening on udp4 127.0.0.1 port 2003 [2017-04-21 17:26:56] (MSG) listening on UNIX socket /tmp/.s.carbon-c-relay.2003 [2017-04-21 17:26:56] (MSG) starting 4 workers [2017-04-21 17:26:56] (MSG) starting statistics collector [2017-04-21 17:26:56] (MSG) starting servers

/usr/bin/carbon-c-relay -P /var/run/carbon-c-relay/carbon-c-relay.pid -D -p 2003 -i 127.0.0.1 -w 4 -b 2500 -q 25000 -l /var/log/carbon-c-relay/carbon-c-relay.log -s -f /etc/carbon-c-relay.conf

grobian commented 7 years ago

For 1. I'm searching, sofar no clue why you see that behaviour yet. For 2. I'm not sure if I understand your request. Could it be that you want useall behaviour? It will expand all IP addresses instead of re-resolving on each connection attempt.

grobian commented 7 years ago

As for 1. I think it's an inverted logic bug.

grobian commented 7 years ago

Due to a bug, connections would never switch to another IP, memory would be leaked though.

g2bg commented 7 years ago

Validated that the inverted logic fix does now rotate ips. Unfortunately still seeing the connection increasing issue. I can easily replicate this one for debugging, within 5 min it's built up 100 or so established connections to the other relays. This is on both centos 6 and 7.

As for 2, my thought was if a RR DNS was used in an ANY_OF entry could we pull all the ip's from the dns entry and put them in as individual members. That way it would split the traffic between them but would allow for a node being down. This would allow a centralized mgmt of the endpoints vs having all the ip's in the remote relays.

grobian commented 7 years ago

Ok, so there's something going on there.

What you describe in 2 seems exactly like the useall feature to me. Try this:

cluster local_carbon
any_of useall
multiipdnsentry:2003
;

From what you tell me in this mode the relay should not build up a large pile of connections. If this is indeed the case, it should narrow the search somewhat, but I'm curious...

g2bg commented 7 years ago

Agreed it looks like useall is what I was looking for there. Thank you so much.

As for the other issue that only made it quite worse, after 1 min I had 240 established connections. I have a build setup I can add some debugging code to try and find it just not quite sure where to put it at the moment.

grobian commented 7 years ago

The idea is that a connection is made, and reused when there are metrics to write within a certain timeout (something like 10s off the top of my head). It should absolutely NOT open a new connection for each time it tries to write.

grobian commented 7 years ago

I just did a simple test to verify the disconnect behaviour, and it seems to trigger (it's 3 seconds). Can you tell me a bit about how many addresses your multiipdnsentry resolves to, and how much data is flowing towards the relay? If you use the stats, how much connections are made to the relay (nonNegativeDerivative(carbon.relays.host.connections)), and what are the other relays? Are they also c-relays, or different software?

g2bg commented 7 years ago

Sure, this is a per host relay to relay setup. Client / Host is setup as above and has a collectd and other application metrics pumping in on a 10s interval. I believe last count was around 2k metrics or so. The Destinations 4 of them have carbon-c-relay (I've tried this with 2.5,2.6 and 3.0 no diff) talk to the multitude of carbon-cache instances. Currently running around 1m/s metrics on 3.0. Watching the behavior of 2.5 it is opening and closing connections on interval. In 2.6/3.0 it opens the connections on interval as well, just never closes the old. Only with the RR DNS though if I put all 4 destination ip's in there's no issue.

Interesting side note. If I set the DNS entry with useall it expands to the IP's as members in the log when it outputs the config. This has the connection issues though. If I copy / paste that config and use it there's no problems.

Is there a way to adjust the disconnect timeout?

grobian commented 7 years ago

think I found the problem

grobian commented 7 years ago

If you could try latest master, that would be awesome. If it solves the problem for you, I'll release v3.1 shortly to fix this screwup.

g2bg commented 7 years ago

Centos 6 fails to make from master bison -d conffile.y conffile.y:36.20-30: syntax error, unexpected {...} make: *** [conffile.tab.c] Error 1

Centos 7 completes. With a any_of useall and RRDNS it expands in the log output config still. It goes in order only connecting to 1 ip. If the first fails it goes to the second. Connections do not grow with this config.

example: 10.1.1.3:2003 10.1.1.2:2003 10.1.1.1:2003 10.1.1.4:2003 it will always choose 10.1.1.3 unless it's unavailable. Is this expected behavior for a any_of useall ? If I specify all 4 ip's in the config it will connect to all 4.

grobian commented 7 years ago

You can touch conffile.tab. and conffile.yy. for git doesn't store mtimes :(

I haven't found a way to work around this yet.

I'll look into why useall doesn't connect to others.

grobian commented 7 years ago

hah, use_all never updates the configuration, so the router thinks there's only one entry.

grobian commented 7 years ago

hmmm, test mode shows all entries would get used ...

grobian commented 7 years ago

I've not been able to reproduce the behaviour where it will pick the first node. That actually is the behaviour of a failover cluster. Not that I don't trust your observations, but are you sure you use any_of useall in this case, and you see no distribution of metrics over all of the expanded hosts?

grobian commented 7 years ago

I think I found a reason/cause for the behaviour you see.

grobian commented 7 years ago

I think I've fixed this, if not, please reopen.