Carbon Relay - long lived connection to target

genericgithubuser commented 8 years ago

We have a pool of graphite relays running in our datacenter. They receive client metrics, and forward them to a relay pool we have running in AWS (behind an ELB). The internal relays have the DNS name of the ELB in the carbon.conf. These in turn forward the metrics to our AWS based cache servers.

The problem I'm seeing is this. At times the IP addresses of the ELB will change. This is of course is reflected in the DNS return values. But it seems after the carbon relay does the initial DNS lookup, it will continue to forward to the IP it initially pulled. Even when that IP is no longer being properly responded to it seems the relay does not do a new lookup.

Have others ran into this, or is there a workaround aside from a regular relay restart to get around this? I had scheduled restarts of the relay service, but it seems it was not fully flushing its queued metrics on restart, so there would be a chunk of lost datapoints ever X time when the restart was scheduled, so I'd like to avoid a situation where we're dropping metrics if at all possible.

Running 0.9.15 across all components

obfuscurity commented 8 years ago

I would suggest a proxy that can identify these changes and update their internal resolution table. I believe that HAProxy can do this but I haven't needed to use this before. I would love to know if that works for you.

genericgithubuser commented 8 years ago

Ah, of course. Yeah, 1.6 and up of haproxy supports this. It gives the base requirement of seeing the IP fail and switching. I plan to test this out (will let you know how things go), but will close this ticket out since it might be a bit.

obfuscurity commented 8 years ago

Awesome, good luck! Looking forward to an exhaustive report. :wink:

graphite-project / carbon

Carbon Relay - long lived connection to target #608