uber-archive / statsrelay

A consistent-hashing relay for statsd and carbon metrics
Other
101 stars 28 forks source link

DNS doesn't resolve after "Connection timed out" #66

Closed mschurenko closed 7 years ago

mschurenko commented 7 years ago

We had to replace an instance of statsite (we rerun in EC2) which resulted in it getting a new IP. It seems that statsrelay does not re-resolve DNS after it gets a "Connection timed out" error. This would be useful as we also had to bounce all of our statsrelays in order to replace the failed statsite instance.

Here is what we had in syslog after we had replacing the statsite instance:

Dec  7 17:29:12 statsrelay1 docker/statsrelay1[7311]: ERROR: tcpclient[statsite5/9600/tcp]: Error from recv: Connection timed out

Thanks

JeremyGrosser commented 7 years ago

I need a bit more information to reproduce this problem...

mschurenko commented 7 years ago

Hi,

We are using version 1.6.8 and compiling it ourselves. None of those other messages appear in our logs.

Here is some more context about the "Connection timed out" error:

Dec  7 17:29:12 pipes-prod-statsdflinger-leo-c0949ad8 statsrelay: tcpclient[pipes-prod-statsite-leo-5.pipes.aws.company.net/9600/tcp]: Error from recv: Connection timed out
Dec  7 17:29:12 pipes-prod-statsdflinger-leo-c0949ad8 statsrelay: tcpclient[pipes-prod-statsite-leo-5.pipes.aws.company.net/9600/tcp]: State transition CONNECTED -> BACKOFF
Dec  7 17:29:14 pipes-prod-statsdflinger-leo-c0949ad8 statsrelay: tcpclient[pipes-prod-statsite-leo-5.pipes.aws.company.net/9600/tcp]: State transition BACKOFF -> INIT
Dec  7 17:29:14 pipes-prod-statsdflinger-leo-c0949ad8 statsrelay: tcpclient[pipes-prod-statsite-leo-5.pipes.aws.company.net/9600/tcp]: State transition INIT -> CONNECTING
Dec  7 17:29:16 pipes-prod-statsdflinger-leo-c0949ad8 statsrelay: tcpclient[pipes-prod-statsite-leo-5.pipes.aws.company.net/9600/tcp]: Connection timeout
Dec  7 17:29:16 pipes-prod-statsdflinger-leo-c0949ad8 statsrelay: tcpclient[pipes-prod-statsite-leo-5.pipes.aws.company.net/9600/tcp]: State transition CONNECTING -> BACKOFF
Dec  7 17:29:18 pipes-prod-statsdflinger-leo-c0949ad8 statsrelay: tcpclient[pipes-prod-statsite-leo-5.pipes.aws.company.net/9600/tcp]: State transition BACKOFF -> INIT
Dec  7 17:29:18 pipes-prod-statsdflinger-leo-c0949ad8 statsrelay: tcpclient[pipes-prod-statsite-leo-5.pipes.aws.company.net/9600/tcp]: State transition INIT -> CONNECTING
Dec  7 17:29:20 pipes-prod-statsdflinger-leo-c0949ad8 statsrelay: tcpclient[pipes-prod-statsite-leo-5.pipes.aws.company.net/9600/tcp]: Connection timeout
JeremyGrosser commented 7 years ago

If you're not seeing "Error resolving backend address", then the call to getaddrinfo in libc is succeeding and returning the stale address. statsrelay's code does call getaddrinfo every time it enters the CONNECTING state.

I'm guessing that you're running nscd on your systems, which adds caching to libc and is outside the control of statsrelay. You can disable it by stopping that daemon (which may have negative consequences for other things) or you can manually flush the cache as needed with the nscd --invalidate=hosts command.

mschurenko commented 7 years ago

We're not running nscd. We have other services using the same OS that do pick up DNS changes by calling getaddrinfo. To get statsrelay to pick up the new DNS change all I did was restart the process. It is running in a docker container so I did

docker restart statsrelay
JeremyGrosser commented 7 years ago

I think this patch should work... https://github.com/uber/statsrelay/pull/67

You'll need to add always_resolve_dns: true to your config.

mschurenko commented 7 years ago

Thanks!!!