grobian / carbon-c-relay

Enhanced C implementation of Carbon relay, aggregator and rewriter
Apache License 2.0
380 stars 107 forks source link

periodically log about non-ideal conditions #176

Open Civil opened 8 years ago

Civil commented 8 years ago

Currently, not all cases where drops occurs are logged, in fact if the drop happened because of stall metric, it won't be logged at all, even in debug mode.

I think that all abnormal behavior should be logged as error - cause that's the easiest way to find what's going on and with what server. So I suggest that carbon-c-relay should write a message with number of metrics dropped (e.x.: "backend #### - ## stalled metrics dropped")

grobian commented 8 years ago

What do you mean with drops are invisible when stalls happen? Does the graphite metric itself not include the drops in the counter?

If the relay were to log about drops, when and how often should it do so?

Civil commented 8 years ago

Drop counter is increased, but even in debug mode there is nothing in the log about what happened.

E.x. if all the backend is down and relay gave up it writes "backend blah 100500 metrics dropped". If it's dropping because of stall metrics, it doesn't say anything.

That happened when I was trying to setup some test clusters with different backends. For example when communicating with influxdb I see some amount of drops, but it's not very easy to find that it's actually happens because of stalled metrics.

grobian commented 8 years ago

stalling per definition means it's not dropping (because it tries to avoid doing that)

Civil commented 8 years ago

https://github.com/grobian/carbon-c-relay/blob/master/server.c#L570 It tries to avoid it, but it's dropping them in some cases.

grobian commented 8 years ago

yeah, but it returns, so it doesn't drop (for it doesn't enqueue)

Civil commented 8 years ago

It increases droped counter, so on graphs it's visible as drops. If it's not actually dropping them, then it shouldn't touch the counter.

grobian commented 8 years ago

if, the number of stalls > MAX_STALLS: then it is a drop, and counted as such. else, it counts as stall and the code returns

grobian commented 8 years ago

IOW: server stall once -> ok server stall twice -> ok server stall thrice -> drop! (unblock client)

Civil commented 8 years ago

Yup. My point is that you should log that.

grobian commented 8 years ago

Log what exactly? If I'd add a logline in that very bit of code, you'd spam yourself into a DOS in case you encounter a target that b0rks. That's why the counters are there. So what exactly would you like to be notified of, and when?

Civil commented 8 years ago

The best option is to have one logline per backend that stalls metrics. E.x. "backend blah: 100500 metrics stalled metrics dropped after 4 tries".

Otherwise identifying the issue can take quite some time.

Civil commented 8 years ago

Another use case for that - some user actually sending metrics that they really don't want to send (e.x. metric with name HASH0xblahblah). It'll be really nice to blackhole them but with a log message once in a while that metrics that were matching the rule was blackholed and give some examples of such metrics.

grobian commented 8 years ago
cluster mylog
    file ip /var/log/crap.log;

match crap
   send to mylog
   stop;

The only thing here is how this would respond to logrotate, my first suspicion is that a SIGHUP won't close/open the crap.log file.

Civil commented 8 years ago

Yeah, but as you know people who send crap usually enormous amount of it, so simple logging won't help much (will use all disk possible), so that's more about sampling logging (log each 10000th metric for example).

SIGHUP thing is easy to fix actually, so it won't be a problem.

grobian commented 8 years ago

Perhaps I should introduce the "sample" construct so you can do this.

Civil commented 8 years ago

Yes, sampling should be sufficient.

grobian commented 6 years ago

re-reading this issue, I think it would be nice as in the first post when drops etc. are logged in a low-volume manner such that it's easier to spot problems from the logfiles.