Carbon is one of the components of Graphite, and is responsible for receiving metrics over the network and writing them down to disk using a storage backend.
There is a config setting, USE_FLOW_CONTROL, which defaults to enabled. The stated behaviour is that, when enabled, carbon will pause reading metrics when its sink is full. Indeed, carbon does pause receiving metrics, but it also drops any "in-flight" metrics.
Test setup
Set up Carbon to not relay aggregator misses, and to use line protocol (so it's easy to test from bash). Also set up a.* to aggregate every 3 seconds, and carbon to spit out stats every 1 second, just to avoid waiting around. And verify that FLOW_CONTROL is True, so that data shouldn't be dropped when buffers fill.
(If I make a longer metric name, the count drops, so this looks related to a buffer size at some point in the pipeline.)
Send 80k aggregated metrics:
~/carbon$ date=1724239620; for i in {100001..180000}; do echo a.$i 1000 $date; done > test-data; nc -q0 localhost 2023 < test-data
This gives much less consistent output, but does consistently drop data anyway. The timestamps are later than in the unaggregated case, so the dropping is at aggregation time, not production time:
~/carbon$ date=1724239620; for i in {100001..180000}; do echo a.100000 1000 $date; done > test-data; nc -q0 localhost 2023 < test-data
This never drops anything, even with very, very large counts.
There are a few problem cases:
Where the aggregator passes through metrics verbatim (under FORWARD_ALL), acting as a relay, when the MAX_QUEUE_SIZE is hit, the cacheFull event is called. However, this apparently only pauses reading from the socket, but doesn't pause processing of already-read data. This means it turns on flow control, but drops a bunch of metrics on the floor immediately. This bug probably affects the relay service, too.
Where the aggregator acts as an aggregator, it will fill its buffers with as many IntervalBuffer pairs as it can, then aggregate out the intervals when their LoopingCalls tick over. The aggregation process is not paused at all when the buffer fills, which means, again, metrics are just dropped on the floor when the output becomes full. Realistically, the problem here is that the aggregator's state isn't properly limiting its count of IntervalBuffer pairs to MAX_QUEUE_SIZE -- if it ever exceeds this value, it is likely to drop data on the next aggregation tick, so should be considered independently full.
Curiously, there is no bound on the amount of data inside a given buffer, nor on the number of buffers, so in the case where the test generates tens of thousands of datapoints which all aggregate into one value, the aggregator dutifully generates a single list with all of the datapoints in RAM and then spits them out in one go, later. There's no failure in my test, then, but the size of the aggregator is in theory unbounded in this case!
Inspecting the code, the same logic exists in _MetricCache.store -- it relies on events.cacheFull() to prevent any more calls, and will drop any datapoints after this until the queue drains somewhat.
Expected behavior
With USE_FLOW_CONTROL = True, no metrics should ever be dropped.
Suggestion
I propose that:
USE_FLOW_CONTROL should be properly implemented for daemons other than the cache. Fixing this reduces the number of drops in my example from 14500 to 6332 but not to zero. It's also very simple to do.
MAX_QUEUE_SIZE should be (allowed to be?) taken as a soft limit which causes the input streams to pause, and the daemon be allowed to exceed this limit perhaps by some factor once the inputs are paused -- or it should soft-cap below MAX_QUEUE_SIZE. Perhaps QUEUE_LOW_WATERMARK_PCT should be lowered, and QUEUE_HIGH_WATERMARK_PCT should be added, with the queue considered full above this but not actually dropping data. Experimentation suggests QUEUE_LOW_WATERMARK_PCT = 0.6 and QUEUE_HIGH_WATERMARK_PCT = 0.7 are sufficient to avoid metric dropping on the default MAX_QUEUE_SIZE = 10000. Setting 0.6/0.8 was not sufficient. Alternatively, HARD_MAX_QUEUE_SIZE_PCT = 1.25 (taking MAX_QUEUE_SIZE as the high watermark) and the default QUEUE_LOW_WATERMARK_PCT = 0.8 was fine, too.
The same, or similar, massaging should be applied to MAX_CACHE_SIZE.
These will fix the relay behaviour.
The aggregator's IntervalBuffer count should be capped to some function of MAX_QUEUE_SIZE or some new setting, and behave like a full queue when it exceeds this. Flow control might not be the correct solution here; the aggregation buffer count can never be "caught up" but is instead a function of the number of active metrics.
There should be some limit on the number of datapoints per IntervalBuffer and/or the total number of datapoints shared between all of these.
The fix for the aggregation case is difficult, as the code currently only understands when its sole sink (the relay output, in this case) is blocked. The logic needs to understand that it can be paused for multiple reasons. These might best be moved to a distinct ticket.
There is a config setting,
USE_FLOW_CONTROL
, which defaults to enabled. The stated behaviour is that, when enabled, carbon will pause reading metrics when its sink is full. Indeed, carbon does pause receiving metrics, but it also drops any "in-flight" metrics.Test setup
Set up Carbon to not relay aggregator misses, and to use line protocol (so it's easy to test from bash). Also set up
a.*
to aggregate every 3 seconds, and carbon to spit out stats every 1 second, just to avoid waiting around. And verify thatFLOW_CONTROL
isTrue
, so that data shouldn't be dropped when buffers fill.Fire up a socket to look at what comes out:
Fire up an aggregator:
Send 30k unaggregated metrics:
This fairly consistently gives me stats like:
(If I make a longer metric name, the count drops, so this looks related to a buffer size at some point in the pipeline.)
Send 80k aggregated metrics:
This gives much less consistent output, but does consistently drop data anyway. The timestamps are later than in the unaggregated case, so the dropping is at aggregation time, not production time:
Send 1 aggregated metric with 80k datapoints:
This never drops anything, even with very, very large counts.
There are a few problem cases:
FORWARD_ALL
), acting as a relay, when theMAX_QUEUE_SIZE
is hit, thecacheFull
event is called. However, this apparently only pauses reading from the socket, but doesn't pause processing of already-read data. This means it turns on flow control, but drops a bunch of metrics on the floor immediately. This bug probably affects the relay service, too.IntervalBuffer
pairs as it can, then aggregate out the intervals when theirLoopingCall
s tick over. The aggregation process is not paused at all when the buffer fills, which means, again, metrics are just dropped on the floor when the output becomes full. Realistically, the problem here is that the aggregator's state isn't properly limiting its count ofIntervalBuffer
pairs toMAX_QUEUE_SIZE
-- if it ever exceeds this value, it is likely to drop data on the next aggregation tick, so should be considered independently full.Inspecting the code, the same logic exists in
_MetricCache.store
-- it relies onevents.cacheFull()
to prevent any more calls, and will drop any datapoints after this until the queue drains somewhat.Expected behavior
With
USE_FLOW_CONTROL = True
, no metrics should ever be dropped.Suggestion
I propose that:
USE_FLOW_CONTROL
should be properly implemented for daemons other than the cache. Fixing this reduces the number of drops in my example from 14500 to 6332 but not to zero. It's also very simple to do.MAX_QUEUE_SIZE
should be (allowed to be?) taken as a soft limit which causes the input streams to pause, and the daemon be allowed to exceed this limit perhaps by some factor once the inputs are paused -- or it should soft-cap belowMAX_QUEUE_SIZE
. PerhapsQUEUE_LOW_WATERMARK_PCT
should be lowered, andQUEUE_HIGH_WATERMARK_PCT
should be added, with the queue considered full above this but not actually dropping data. Experimentation suggestsQUEUE_LOW_WATERMARK_PCT = 0.6
andQUEUE_HIGH_WATERMARK_PCT = 0.7
are sufficient to avoid metric dropping on the defaultMAX_QUEUE_SIZE = 10000
. Setting 0.6/0.8 was not sufficient. Alternatively,HARD_MAX_QUEUE_SIZE_PCT = 1.25
(takingMAX_QUEUE_SIZE
as the high watermark) and the defaultQUEUE_LOW_WATERMARK_PCT = 0.8
was fine, too.MAX_CACHE_SIZE
.These will fix the relay behaviour.
IntervalBuffer
count should be capped to some function ofMAX_QUEUE_SIZE
or some new setting, and behave like a full queue when it exceeds this. Flow control might not be the correct solution here; the aggregation buffer count can never be "caught up" but is instead a function of the number of active metrics.IntervalBuffer
and/or the total number of datapoints shared between all of these.The fix for the aggregation case is difficult, as the code currently only understands when its sole sink (the relay output, in this case) is blocked. The logic needs to understand that it can be paused for multiple reasons. These might best be moved to a distinct ticket.