[BUG] Even with flow control enabled, carbon aggregator drops incoming metrics when output queues are full

There is a config setting, USE_FLOW_CONTROL, which defaults to enabled. The stated behaviour is that, when enabled, carbon will pause reading metrics when its sink is full. Indeed, carbon does pause receiving metrics, but it also drops any "in-flight" metrics.

Test setup

Set up Carbon to not relay aggregator misses, and to use line protocol (so it's easy to test from bash). Also set up a.* to aggregate every 3 seconds, and carbon to spit out stats every 1 second, just to avoid waiting around. And verify that FLOW_CONTROL is True, so that data shouldn't be dropped when buffers fill.

~/carbon$ (cd test-root/conf; for i in *.conf; do diff -u0 ../../conf/$i.example $i; done)
--- ../../conf/aggregation-rules.conf.example   2021-10-08 18:21:23.195248379 +0100
+++ aggregation-rules.conf      2024-08-21 13:31:47.789993319 +0100
@@ -59,0 +60,2 @@
+
+a.<a> (3) = sum a.<a>
--- ../../conf/carbon.conf.example      2024-07-08 17:43:30.486176214 +0100
+++ carbon.conf 2024-08-21 14:35:32.558819082 +0100
@@ -436 +436 @@
-DESTINATIONS = 127.0.0.1:2004
+DESTINATIONS = 127.0.0.1:2003
@@ -441 +441 @@
-# DESTINATION_PROTOCOL = pickle
+DESTINATION_PROTOCOL = line
@@ -614 +614,2 @@
-DESTINATIONS = 127.0.0.1:2004
+DESTINATIONS = 127.0.0.1:2003
+DESTINATION_PROTOCOL = line
@@ -668 +669 @@
-# CARBON_METRIC_INTERVAL = 60
+CARBON_METRIC_INTERVAL = 1
@@ -680,0 +682 @@
+LOG_AGGREGATOR_MISSES = False
~/carbon$ grep ^USE_FLOW_CONTROL test-root/conf/carbon.conf 
USE_FLOW_CONTROL = True
USE_FLOW_CONTROL = True
USE_FLOW_CONTROL = True

Fire up a socket to look at what comes out:

~/carbon$ nc -l 2003 | grep -i drop

Fire up an aggregator:

~/carbon$ GRAPHITE_ROOT=$HOME/carbon/test-root/ PYTHONPATH=lib test-env/bin/twistd --nodaemon --reactor=epoll --no_save carbon-aggregator

Send 30k unaggregated metrics:

~/carbon$ date=1724239620; for i in {100001..130000}; do echo b.$i 1000 $date; done > test-data; nc -q0 localhost 2023 < test-data
1724239620

This fairly consistently gives me stats like:

carbon.aggregator.hostname-a.destinations.127_0_0_1:2003:None.fullQueueDrops 14500 1724248203

(If I make a longer metric name, the count drops, so this looks related to a buffer size at some point in the pipeline.)

Send 80k aggregated metrics:

~/carbon$ date=1724239620; for i in {100001..180000}; do echo a.$i 1000 $date; done > test-data; nc -q0 localhost 2023 < test-data

This gives much less consistent output, but does consistently drop data anyway. The timestamps are later than in the unaggregated case, so the dropping is at aggregation time, not production time:

carbon.aggregator.hostname-a.destinations.127_0_0_1:2003:None.fullQueueDrops 4369 1724248638
carbon.aggregator.hostname-a.destinations.127_0_0_1:2003:None.fullQueueDrops 1503 1724248639
carbon.aggregator.hostname-a.destinations.127_0_0_1:2003:None.fullQueueDrops 228 1724248640

Send 1 aggregated metric with 80k datapoints:

~/carbon$ date=1724239620; for i in {100001..180000}; do echo a.100000 1000 $date; done > test-data; nc -q0 localhost 2023 < test-data

This never drops anything, even with very, very large counts.

There are a few problem cases:

Where the aggregator passes through metrics verbatim (under FORWARD_ALL), acting as a relay, when the MAX_QUEUE_SIZE is hit, the cacheFull event is called. However, this apparently only pauses reading from the socket, but doesn't pause processing of already-read data. This means it turns on flow control, but drops a bunch of metrics on the floor immediately. This bug probably affects the relay service, too.
Where the aggregator acts as an aggregator, it will fill its buffers with as many IntervalBuffer pairs as it can, then aggregate out the intervals when their LoopingCalls tick over. The aggregation process is not paused at all when the buffer fills, which means, again, metrics are just dropped on the floor when the output becomes full. Realistically, the problem here is that the aggregator's state isn't properly limiting its count of IntervalBuffer pairs to MAX_QUEUE_SIZE -- if it ever exceeds this value, it is likely to drop data on the next aggregation tick, so should be considered independently full.
Curiously, there is no bound on the amount of data inside a given buffer, nor on the number of buffers, so in the case where the test generates tens of thousands of datapoints which all aggregate into one value, the aggregator dutifully generates a single list with all of the datapoints in RAM and then spits them out in one go, later. There's no failure in my test, then, but the size of the aggregator is in theory unbounded in this case!

Inspecting the code, the same logic exists in _MetricCache.store -- it relies on events.cacheFull() to prevent any more calls, and will drop any datapoints after this until the queue drains somewhat.

Expected behavior

With USE_FLOW_CONTROL = True, no metrics should ever be dropped.

Suggestion

I propose that:

USE_FLOW_CONTROL should be properly implemented for daemons other than the cache. Fixing this reduces the number of drops in my example from 14500 to 6332 but not to zero. It's also very simple to do.
MAX_QUEUE_SIZE should be (allowed to be?) taken as a soft limit which causes the input streams to pause, and the daemon be allowed to exceed this limit perhaps by some factor once the inputs are paused -- or it should soft-cap below MAX_QUEUE_SIZE. Perhaps QUEUE_LOW_WATERMARK_PCT should be lowered, and QUEUE_HIGH_WATERMARK_PCT should be added, with the queue considered full above this but not actually dropping data. Experimentation suggests QUEUE_LOW_WATERMARK_PCT = 0.6 and QUEUE_HIGH_WATERMARK_PCT = 0.7 are sufficient to avoid metric dropping on the default MAX_QUEUE_SIZE = 10000. Setting 0.6/0.8 was not sufficient. Alternatively, HARD_MAX_QUEUE_SIZE_PCT = 1.25 (taking MAX_QUEUE_SIZE as the high watermark) and the default QUEUE_LOW_WATERMARK_PCT = 0.8 was fine, too.
The same, or similar, massaging should be applied to MAX_CACHE_SIZE.

These will fix the relay behaviour.

The aggregator's IntervalBuffer count should be capped to some function of MAX_QUEUE_SIZE or some new setting, and behave like a full queue when it exceeds this. Flow control might not be the correct solution here; the aggregation buffer count can never be "caught up" but is instead a function of the number of active metrics.
There should be some limit on the number of datapoints per IntervalBuffer and/or the total number of datapoints shared between all of these.

The fix for the aggregation case is difficult, as the code currently only understands when its sole sink (the relay output, in this case) is blocked. The logic needs to understand that it can be paused for multiple reasons. These might best be moved to a distinct ticket.

graphite-project / carbon

[BUG] Even with flow control enabled, carbon aggregator drops incoming metrics when output queues are full #956