grobian / carbon-c-relay

Enhanced C implementation of Carbon relay, aggregator and rewriter
Apache License 2.0
380 stars 107 forks source link

Corrupted (strange, out of band) values delivered to graphite #320

Closed zdenekpizl closed 6 years ago

zdenekpizl commented 7 years ago

Hallo,

it has happen multiple time during last weeks that metrics data stored in Graphite's whisper datafiles was broken. I.e. www requests has been normally in hundreds, but suddenly one or more datapoints is in millions, i.e.: ([{"target": "collectd.datacenter-01.www02.apache-apache.apache_requests", "datapoints": [ [122.282847, 1508372760], [64512609.0, 1508372820], [null, 1508372880], [122.549544, 1508372940], [135.06596500000001, 1508373000], ...]]}])

This behaviour is observed within multiple nodes, different plugins (either collectd ones or thirdparty ones) and in different times. The only common fact is that we're using carbon-c-relay as transport mechanism from locally collected instance's metrics.

Do anybody meet similar issue? Thank you, regards Zdenek Pizl.

grobian commented 7 years ago

what version of carbon-c-relay do you have in use?

zdenekpizl commented 7 years ago

Packages installed:

grobian commented 7 years ago

I'm assuming you're referring to "64512609.0" in this case, right?

zdenekpizl commented 7 years ago

exactly, it is something which cannot happen :) It would be fine to be able to hande 64 millions of request per second on single node, but as you can see, it is nonsense.

grobian commented 7 years ago

and there is no aggregation of some sort in place? (relay doesn't "interpret" data, unless it does aggregations)

zdenekpizl commented 7 years ago

I remember there were an issue carbon-c-relay somehow mangle metric's name. Could it be the same or similar root cause?

zdenekpizl commented 7 years ago

There is no aggregation using carbon-c-relay for sure, for metrics I'm observing this issue there is also no aggregation on Graphite's carbon-aggregator level (in other words no carbon-aggregator is used).

grobian commented 7 years ago

I can't rule anything like that out, but hard evidence it's in the c-relay would be nice, I'm not sure how often this happens?

zdenekpizl commented 7 years ago

It is pretty often behaviour, but I cannot find out any pattern or rule when it happens.

Could you advise how to proof whether it is carbon-c-relay related?

grobian commented 7 years ago

a tcpdump showing a correct metric going in, and corrupted going out is pretty much rock-solid proof it's on c-relay's end

grobian commented 7 years ago

(that also helps me to see what kind of corruption happened, gives a lead on how to fix it)

zdenekpizl commented 7 years ago

OK, I will try to dump traffic and go through it. Thanks in advance.

grobian commented 7 years ago

sorry, it's the only thing to start the search with :(

Farfaday commented 7 years ago

Could maybe be related to https://github.com/collectd/collectd/issues/2209 ?

I still add that issue until updating to collectd 5.7.2. Btw, collectd also has its own packages repo: https://collectd.org/download.shtml#debian

zdenekpizl commented 7 years ago

@Farfaday - it looks related, yes. Maybe the problem is directly in collectd's write plugin infra. I will try to upgrade to collectd 5.7.2 and observe its behaviour at first. Thanks.

grobian commented 7 years ago

Cool, let me know, thanks for all the help!

zdenekpizl commented 6 years ago

I was able to proof that the issue is NOT on the carbon-c-relay apparently. Debugging what is coming to the carbon-c-relay has shown that corrupted values are sent by write_graphite plugin of collectd.

As we're using collectd-5.7.1 we are going to upgrade to 5.7.2 which is promising to have at least this bug fixed.

Thanks to all of you, guys, to help me find out the root cause.

grobian commented 6 years ago

awesome, thanks!