Closed mschurenko closed 7 years ago
I need a bit more information to reproduce this problem...
Hi,
We are using version 1.6.8 and compiling it ourselves. None of those other messages appear in our logs.
Here is some more context about the "Connection timed out" error:
Dec 7 17:29:12 pipes-prod-statsdflinger-leo-c0949ad8 statsrelay: tcpclient[pipes-prod-statsite-leo-5.pipes.aws.company.net/9600/tcp]: Error from recv: Connection timed out
Dec 7 17:29:12 pipes-prod-statsdflinger-leo-c0949ad8 statsrelay: tcpclient[pipes-prod-statsite-leo-5.pipes.aws.company.net/9600/tcp]: State transition CONNECTED -> BACKOFF
Dec 7 17:29:14 pipes-prod-statsdflinger-leo-c0949ad8 statsrelay: tcpclient[pipes-prod-statsite-leo-5.pipes.aws.company.net/9600/tcp]: State transition BACKOFF -> INIT
Dec 7 17:29:14 pipes-prod-statsdflinger-leo-c0949ad8 statsrelay: tcpclient[pipes-prod-statsite-leo-5.pipes.aws.company.net/9600/tcp]: State transition INIT -> CONNECTING
Dec 7 17:29:16 pipes-prod-statsdflinger-leo-c0949ad8 statsrelay: tcpclient[pipes-prod-statsite-leo-5.pipes.aws.company.net/9600/tcp]: Connection timeout
Dec 7 17:29:16 pipes-prod-statsdflinger-leo-c0949ad8 statsrelay: tcpclient[pipes-prod-statsite-leo-5.pipes.aws.company.net/9600/tcp]: State transition CONNECTING -> BACKOFF
Dec 7 17:29:18 pipes-prod-statsdflinger-leo-c0949ad8 statsrelay: tcpclient[pipes-prod-statsite-leo-5.pipes.aws.company.net/9600/tcp]: State transition BACKOFF -> INIT
Dec 7 17:29:18 pipes-prod-statsdflinger-leo-c0949ad8 statsrelay: tcpclient[pipes-prod-statsite-leo-5.pipes.aws.company.net/9600/tcp]: State transition INIT -> CONNECTING
Dec 7 17:29:20 pipes-prod-statsdflinger-leo-c0949ad8 statsrelay: tcpclient[pipes-prod-statsite-leo-5.pipes.aws.company.net/9600/tcp]: Connection timeout
If you're not seeing "Error resolving backend address", then the call to getaddrinfo in libc is succeeding and returning the stale address. statsrelay's code does call getaddrinfo every time it enters the CONNECTING state.
I'm guessing that you're running nscd on your systems, which adds caching to libc and is outside the control of statsrelay. You can disable it by stopping that daemon (which may have negative consequences for other things) or you can manually flush the cache as needed with the nscd --invalidate=hosts
command.
We're not running nscd. We have other services using the same OS that do pick up DNS changes by calling getaddrinfo. To get statsrelay to pick up the new DNS change all I did was restart the process. It is running in a docker container so I did
docker restart statsrelay
I think this patch should work... https://github.com/uber/statsrelay/pull/67
You'll need to add always_resolve_dns: true
to your config.
Thanks!!!
We had to replace an instance of statsite (we rerun in EC2) which resulted in it getting a new IP. It seems that statsrelay does not re-resolve DNS after it gets a "Connection timed out" error. This would be useful as we also had to bounce all of our statsrelays in order to replace the failed statsite instance.
Here is what we had in syslog after we had replacing the statsite instance:
Thanks