hashicorp / go-metrics

A Golang library for exporting performance and runtime metrics to external metrics systems (i.e. statsite, statsd)
MIT License
1.47k stars 179 forks source link

Statsd telemetry doesn't recover from statsd outage #39

Open johnrengelman opened 8 years ago

johnrengelman commented 8 years ago

From: https://github.com/hashicorp/vault/issues/1932

It appears that go-metrics doesn't handle a disconnect of the statsd server, particularly if the address changes.

We are running a telegraf agent with a statsd listener and configuring vault to send data to a linked container with a hostname. When the linked container is restarted (generally getting a new IP address), we stop receiving statsd metrics from vault.

armon commented 8 years ago

The basic issue is that UDP provides no feedback. So if statsd dies or the IP is relocated, we can continue to fire-and-forget packets without ever getting an error. Unlike statsite which is over TCP and we get an error and redial. The only potential work around to this would be to periodically just assume the connection is dead and redo the DNS lookup. I'm not sure there is any other robust mechanism given the lack of feedback.

johnrengelman commented 8 years ago

According to this http://serverfault.com/a/416269, if the server side of a UDP socket is disconnected, then there should be an error upon writing (Destination Unreachable), triggered by an ICMP packet. Testing locally with nc, this is the case; establish a connection, terminate the server, and try writing on the client...packet sniffing shows the ICMP packet, and the nc client exits.

I'll have to do some testing next week to see if I'm seeing the same behavior between Vault and statsd, or if I'm somehow dropping the ICMP package on the network.

armon commented 8 years ago

@johnrengelman That's true! But ICMP is not necessarily reliable. It can be disabled, blocked by firewalls, and is fire-and-forget like UDP as well, so it can be simply dropped. There is a best-effort, but the UDP protocol makes no guarantee!