graphite-project / carbon

Carbon is one of the components of Graphite, and is responsible for receiving metrics over the network and writing them down to disk using a storage backend.
http://graphite.readthedocs.org/
Apache License 2.0
1.5k stars 490 forks source link

[BUG] Even with flow control enabled, carbon aggregator drops incoming metrics when output queues are full #956

Open bucko909 opened 3 weeks ago

bucko909 commented 3 weeks ago

There is a config setting, USE_FLOW_CONTROL, which defaults to enabled. The stated behaviour is that, when enabled, carbon will pause reading metrics when its sink is full. Indeed, carbon does pause receiving metrics, but it also drops any "in-flight" metrics.

Test setup

Set up Carbon to not relay aggregator misses, and to use line protocol (so it's easy to test from bash). Also set up a.* to aggregate every 3 seconds, and carbon to spit out stats every 1 second, just to avoid waiting around. And verify that FLOW_CONTROL is True, so that data shouldn't be dropped when buffers fill.

~/carbon$ (cd test-root/conf; for i in *.conf; do diff -u0 ../../conf/$i.example $i; done)
--- ../../conf/aggregation-rules.conf.example   2021-10-08 18:21:23.195248379 +0100
+++ aggregation-rules.conf      2024-08-21 13:31:47.789993319 +0100
@@ -59,0 +60,2 @@
+
+a.<a> (3) = sum a.<a>
--- ../../conf/carbon.conf.example      2024-07-08 17:43:30.486176214 +0100
+++ carbon.conf 2024-08-21 14:35:32.558819082 +0100
@@ -436 +436 @@
-DESTINATIONS = 127.0.0.1:2004
+DESTINATIONS = 127.0.0.1:2003
@@ -441 +441 @@
-# DESTINATION_PROTOCOL = pickle
+DESTINATION_PROTOCOL = line
@@ -614 +614,2 @@
-DESTINATIONS = 127.0.0.1:2004
+DESTINATIONS = 127.0.0.1:2003
+DESTINATION_PROTOCOL = line
@@ -668 +669 @@
-# CARBON_METRIC_INTERVAL = 60
+CARBON_METRIC_INTERVAL = 1
@@ -680,0 +682 @@
+LOG_AGGREGATOR_MISSES = False
~/carbon$ grep ^USE_FLOW_CONTROL test-root/conf/carbon.conf 
USE_FLOW_CONTROL = True
USE_FLOW_CONTROL = True
USE_FLOW_CONTROL = True

Fire up a socket to look at what comes out:

~/carbon$ nc -l 2003 | grep -i drop

Fire up an aggregator:

~/carbon$ GRAPHITE_ROOT=$HOME/carbon/test-root/ PYTHONPATH=lib test-env/bin/twistd --nodaemon --reactor=epoll --no_save carbon-aggregator

Send 30k unaggregated metrics:

~/carbon$ date=1724239620; for i in {100001..130000}; do echo b.$i 1000 $date; done > test-data; nc -q0 localhost 2023 < test-data
1724239620

This fairly consistently gives me stats like:

carbon.aggregator.hostname-a.destinations.127_0_0_1:2003:None.fullQueueDrops 14500 1724248203

(If I make a longer metric name, the count drops, so this looks related to a buffer size at some point in the pipeline.)

Send 80k aggregated metrics:

~/carbon$ date=1724239620; for i in {100001..180000}; do echo a.$i 1000 $date; done > test-data; nc -q0 localhost 2023 < test-data

This gives much less consistent output, but does consistently drop data anyway. The timestamps are later than in the unaggregated case, so the dropping is at aggregation time, not production time:

carbon.aggregator.hostname-a.destinations.127_0_0_1:2003:None.fullQueueDrops 4369 1724248638
carbon.aggregator.hostname-a.destinations.127_0_0_1:2003:None.fullQueueDrops 1503 1724248639
carbon.aggregator.hostname-a.destinations.127_0_0_1:2003:None.fullQueueDrops 228 1724248640

Send 1 aggregated metric with 80k datapoints:

~/carbon$ date=1724239620; for i in {100001..180000}; do echo a.100000 1000 $date; done > test-data; nc -q0 localhost 2023 < test-data

This never drops anything, even with very, very large counts.

There are a few problem cases:

Inspecting the code, the same logic exists in _MetricCache.store -- it relies on events.cacheFull() to prevent any more calls, and will drop any datapoints after this until the queue drains somewhat.

Expected behavior

With USE_FLOW_CONTROL = True, no metrics should ever be dropped.

Suggestion

I propose that:

These will fix the relay behaviour.

The fix for the aggregation case is difficult, as the code currently only understands when its sole sink (the relay output, in this case) is blocked. The logic needs to understand that it can be paused for multiple reasons. These might best be moved to a distinct ticket.

bucko909 commented 3 weeks ago

I'll submit patches for the non-aggregator fixes.

deniszh commented 3 weeks ago

Wow, that's really impressive investigation and fix, much appreciated! Let me check and merge patches.