grafana / carbon-relay-ng

Fast carbon relay+aggregator with admin interfaces for making changes online - production ready
Other
467 stars 151 forks source link

Rate limit on spooled metrics ? #156

Open vidhu5269 opened 7 years ago

vidhu5269 commented 7 years ago

Currently, carbon-relay-ng spools metrics if the destination endpoint is down but when it tries to send those metrics back, it doesn't limit the rate of metrics sent. In one of our Staging setups, we get metrics at a steady rate 40K per minute but due to a network issue in that data center one night, the relay couldn't talk to one of the carbon cache machines for the whole night.

In the morning when we got the issue resolved, the number of incoming metrics increased to 1.3 M per minute and after some time brought the cache down because it couldn't sustain that big of a load. The carbon cache is running on VM and sharing its disk with other machines so there is a limit to how much I/O we can expect.

I looked through the documentation but couldn't find a way of limit the rate of spooled metrics coming down to the caches. Please point me to the doc in case such configurations exists. If not, does it make sense to add this functionality? It will help us to control the transition to steady state better and not cause any more failures when the issue is supposed to be getting resolved.

Dieterbe commented 7 years ago

I think the most elegant solution would be if carbon cache would provide backpressure . ie read data from the connection at the pace it can handle. That will slow down the relay writing to the connection.

vidhu5269 commented 7 years ago

The rate limit has to be applied only for the spooled metrics and not on the regular incoming data. Cache is already handling the queue for "processed metrics" to prevent loss due to I/O latency, if it has to handle back pressure as well, then probably be too much to handle for it.

Also, spooling is a pretty handy feature of carbon-relay-ng but this "batch push" creates an unprecedented data flow which the system may not have scaled for. We want to spool as much data as possible to account for any prolonged failures but it will need a non-linear scaling of underlying carbon caches to meet this flow. On the contrary, a limit at the relay will mean long but smooth transition to steady state and does not need the non-linear scaling for caches.

All my arguments are based on the current carbon cache implementation and not an alternate approach which may or should come in the future. Does it make sense?

Dieterbe commented 7 years ago

recently - #210 - the relay gained a bunch more config options to tune the sleep in between reading metrics off the spool. see https://github.com/graphite-ng/carbon-relay-ng/blob/master/docs/routes.md#carbon-destination for more information. let me know how it goes.

hamelg commented 6 years ago

Great ! Here, tuning unspoolsleep is very helpful. I was setting MAX_CACHE_SIZE on my carbon-cache to limit rate, but It was flooding and some values were dropped. with unspoolsleep = 50, no metrics lost :) Thanks much.