python-diamond / Diamond

Diamond is a python daemon that collects system metrics and publishes them to Graphite (and others). It is capable of collecting cpu, memory, network, i/o, load and disk metrics. Additionally, it features an API for implementing custom collectors for gathering metrics from almost any source.
http://diamond.readthedocs.org/
MIT License
1.74k stars 601 forks source link

GraphitePickleHandler drops metrics when carbon-relay is behind ELB #28

Closed monkey1016 closed 7 years ago

monkey1016 commented 9 years ago

Similar to the closed issue BrightcoveOS/Diamond#458, but it looks like the fixes for that issue didn't work out for the GraphitePickleHandler as well. I haven't had a chance to dive deep, but the gist of it is that when using the GraphitePickleHandler with a carbon-relay behind an ELB, we periodically (3-5 minutes) get the following errors

[2014-12-03 00:18:17,462] [Thread-1] GraphiteHandler: Socket error, trying reconnect.
[2014-12-03 00:18:17,462] [Thread-1] GraphiteHandler: Setting socket keepalives...

But the dropped metrics seem more frequent than that. We've set the idle connection timeout to 5 minutes on the ELB, and we send metrics every 1 minute. What we have seen, all things being equal, when we switch to use the GraphiteHandler (line handler), metrics are submitted just fine and we don't see that error anymore. These are the settings we used for both handlers:

# Socket timeout (seconds)
timeout = 15

# Batch size for pickled metrics
batch = 256

keepalive = 1

max_backlog_multiplier = 10

trim_backlog_multiplier = 2

I'm still seeing if I can get more information about this problem, so if there are any suggestions about what I should check out, please let me know. Thanks

shortdudey123 commented 7 years ago

@monkey1016 is this still occurring? if so, do you have more the info that you mentioned you could get?

monkey1016 commented 7 years ago

To be honest @shortdudey123, I haven't looked at this issue in a long time. I think it's safe to close out.