memory leak in relay and aggregator daemons

dkulikovsky commented 10 years ago

I've got a leaking megacarbon instances. The load is low - 50-100k metrics per minute. Config is simple - i have a pipeline 2 relays with aggregated-consistent-hashing -> 1 aggregator with consistent hashing -> 2 writers. Instances that run relay or aggregation role consumed tons of memory:

top - 16:03:22 up 6 days,  1:09,  7 users,  load average: 143.34, 76.86, 48.83
Tasks: 414 total,   4 running, 410 sleeping,   0 stopped,   0 zombie
Cpu(s):  8.2%us,  0.5%sy,  0.0%ni, 90.0%id,  1.2%wa,  0.0%hi,  0.1%si,  0.0%st
Mem:  65949272k total, 65443728k used,   505544k free,  1215012k buffers
Swap:        0k total,        0k used,        0k free,   269768k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  WCHAN     COMMAND                                                                                                                        
 2824 www-data  20   0 23.7g  23g 1816 R 99.4 37.5  34:38.02 -         /usr/bin/python /var/lib/graphite/bin/carbon-daemon.py carbon-relay-st01 start                                                  
 2897 www-data  20   0 11.9g  11g 1760 R 99.4 18.8   9:28.29 -         /usr/bin/python /var/lib/graphite/bin/carbon-daemon.py carbon-relay-st02 start                                                  
15538 www-data  20   0 23.7g  23g 1796 S 44.7 37.5  14:47.56 ep_poll   /usr/bin/python /var/lib/graphite/bin/carbon-daemon.py carbon-aggregator-st01 start                                             
 1125 www-data  20   0  299m 121m 1860 S  7.9  0.2  31:58.85 poll_sche /usr/bin/python /var/lib/graphite/bin/carbon-daemon.py carbon-writer-st02 start                                                 
 1120 www-data  20   0  310m 146m 1860 S  7.4  0.2  37:09.29 poll_sche /usr/bin/python /var/lib/graphite/bin/carbon-daemon.py carbon-writer-st01 start                                                 
 1118 www-data  20   0  310m 146m 1860 S  3.2  0.2  20:11.00 ep_poll   /usr/bin/python /var/lib/graphite/bin/carbon-daemon.py carbon-writer-st01 start                                                 
 1124 www-data  20   0  299m 121m 1860 S  3.2  0.2  17:23.44 ep_poll   /usr/bin/python /var/lib/graphite/bin/carbon-daemon.py carbon-writer-st02 start

That happens really fast, just after restarting megacarbon daemons and starting perf test tool, in first minute relay and aggregator consume about 720Mb of memory each, in the second minute memory usage grows to 3Gb each. After 20-30 min procs consume 6-11Gb each, and so on. Top output above is taken after 10hours after restarting carbon daemons (perf test lasted for 30min and then all these daemons works only on data it generates by itself). If RELAY_METHOD is set to relay-rules for relay instances - everything is fine, memory usage is about 30-50Mb, but for aggregator this doesn't help at all.

I've tried to catch this memory leak with different tools, but i didn't succeed.

Current config for relay is

REPLICATION_FACTOR = 1
MAX_DATAPOINTS_PER_MESSAGE = 5000
MAX_QUEUE_SIZE = 1000
USE_FLOW_CONTROL = True
DESTINATIONS = 127.0.0.1:12301:st01
RELAY_METHOD = aggregated-consistent-hashing

for aggregator:

REPLICATION_FACTOR = 1
MAX_DATAPOINTS_PER_MESSAGE = 5000
MAX_QUEUE_SIZE = 1000000
USE_FLOW_CONTROL = True
DESTINATIONS = 127.0.0.1:2301:st01, 127.0.0.1:2302:st02
RELAY_METHOD = consistent-hashing

marco-hoyer commented 10 years ago

Sounds interesting, what is the size of your relays destination queues. Is it growing while the memory usage raises? I see something similar with carbon-relay (master branch) not freeing memory after it was used for queueing.

dkulikovsky commented 10 years ago

Number of queued datapoints is growing while memory usage raises. I did add some debug logging and i saw that daemons do add new points to queue. I set queue limit for daemons to 1000 points and this doesn't help. Daemon hits the limit, drops datapoints but still uses Gbs of memory after a while.

19.11.2013, 16:23, "Marco Hoyer" notifications@github.com:

Sounds interesting, what is the size of your relays destination queues. Is it growing while the memory usage raises? I see something similar with carbon-relay (master branch) not freeing memory after it was used for queueing.

— Reply to this email directly or view it on GitHub.

dkulikovsky commented 10 years ago

I reduced the number of aggregation-rules to 500 (was 1000) and memory usage dropped twice, the bug is somewhere there.

mleinart commented 10 years ago

Only a wild guess: could it be a case of catastrophic backtracking in one of your regular expressions in aggregation-rules causing things to back up?

On Tue, Nov 19, 2013 at 10:30 AM, dkulikovsky notifications@github.comwrote:

I reduced the number of aggregation-rules to 500 (was 1000) and memory usage dropped twice, the bug is somewhere there.

— Reply to this email directly or view it on GitHubhttps://github.com/graphite-project/carbon/issues/185#issuecomment-28804964 .

dkulikovsky commented 10 years ago

rules are pretty simple

one_min.agg.test995 (60) = avg one_min.agg_test.test995.m*
one_min.agg.test996 (60) = avg one_min.agg_test.test996.m*
one_min.agg.test997 (60) = avg one_min.agg_test.test997.m*
one_min.agg.test998 (60) = avg one_min.agg_test.test998.m*
one_min.agg.test999 (60) = avg one_min.agg_test.test999.m*
one_min.agg.test1000 (60) = avg one_min.agg_test.test1000.m*

I think i have found a leak, cache in rules can grow unattended, disabling it fixes the problem:

--- lib/carbon/aggregator/rules.py      2013-11-19 23:05:48.632684971 +0400
+++ /var/lib/graphite/lib/carbon/aggregator/rules.py    2013-11-19 23:00:44.158481765 +0400
@@ -93,7 +93,7 @@
       except:
         log.err("Failed to interpolate template %s with fields %s" % (self.output_template, extracted_fields))

-    self.cache[metric_path] = result
+    #self.cache[metric_path] = result
     return result

   def build_regex(self):

frlinux commented 10 years ago

We have a similar and constant memory leak in the relay when using persistent hashing. On a relay getting about 200k metrics per minute and taking a bit of time to write to disk (network storage with a bit of delay on IOPs), we have to restart relays every 12h to free the memory, and when I am saying free, we are talking about 20G leaked in a day.

sebito91 commented 10 years ago

We also have an issue when using consistent hashing like FRLinux. Our environment is sending ~150k metrics every 10 seconds, which is a relatively light load for the node.

The relay executes the hash without much queuing, but the memory consumption spikes through the roof (~5GB) within 1 hour after relay restart. We overrun memory and OOM well under the 12h that FRLinux is seeing.

jamesjuran commented 10 years ago

@dkulikovsky, would the ideal fix for this be a configuration item to limit the size of the cache, and logic to purge LRU items from the cache?

jamesjuran commented 10 years ago

It seems like the rate of memory growth would depend on how often "new" metrics are seen by the aggregator. I'm getting millions of metrics every 5 minutes, and very slow (but non-zero) growth in relay and aggregator memory usage. @frlinux, are you often seeing "new" metrics?

frlinux commented 10 years ago

Hey @jamesjuran, we don't have any real new metrics if by new you mean that did not exist before and had to be created. The growth in carbon-relay in persistent hashing mode is consistent and grows linearly. For instance, on about 280.000 metrics per minute, carbon-relay in hashing mode will grow up to 10 gigs in memory in 6 hours. Restarting the relay purges the memory back. We have that behavior on all relays using hashing. Our setup is all 0.9.10 though so going to test that theory with 0.9.12 that i grabbed today.

jamesjuran commented 10 years ago

I'm using 0.9.12 + a few patches (none of which should have anything to do with memory consumption). Hopefully your upgrade helps.

frlinux commented 10 years ago

Thanks, will report whence I am done. Hoping to test this tomorrow. I just need to look at the new way to configure the different carbons.

frlinux commented 10 years ago

graphite_0912_memoryleak

Including a graph of my latest test, pushing 500k metrics per minute. First graph to the left is carbon cache memusage, middle one is the relay memusage, as you can see consistently growing and last is the number of metrics pushed.

Over an hour, carbon-relay is consistently hogging memory in hashing mode. Here is my configuration for a given cache and the relay that talks to all caches:

[cache:1] USER = apache LINE_RECEIVER_INTERFACE = 0.0.0.0 LINE_RECEIVER_PORT = 2003 PICKLE_RECEIVER_INTERFACE = 0.0.0.0 PICKLE_RECEIVER_PORT = 2004 CACHE_QUERY_INTERFACE = 0.0.0.0 CACHE_QUERY_PORT = 7201 MAX_CACHE_SIZE = inf MAX_UPDATES_PER_SECOND = inf WHISPER_AUTOFLUSH = True LOG_UPDATES = False

[relay:1] RELAY_METHOD = consistent-hashing LINE_RECEIVER_INTERFACE = 0.0.0.0 LINE_RECEIVER_PORT = 2013 PICKLE_RECEIVER_INTERFACE = 0.0.0.0 PICKLE_RECEIVER_PORT = 2014 MAX_QUEUE_SIZE = 10000 MAX_DATAPOINTS_PER_MESSAGE = 500 USE_FLOW_CONTROL = True DESTINATIONS = localhost:2004:1,localhost:2104:2,localhost:2204:3,localhost:2304:4,localhost:2404:5,localhost:2504:6,localhost:2604:7,localhost:2704:8

drawks commented 10 years ago

And if you switch to rule based relay and just configure the single cache what do your graphs look like?

frlinux commented 10 years ago

@drawks we have that in production, rule based does not leak, only relay in hashing mode. It has been there for a while, 0.9.10 and today tested 0.9.12 latest pull from github to see if that fixed it. Both code base have that issue.

drawks commented 10 years ago

are your tests running against a system without any metrics on it yet? If you are blocking on file creates it'll balloon up a bit. I'm not seeing any unbounded growth using CH or aggregation aware CH here. FWIW

frlinux commented 10 years ago

So yes, all metrics were created on the spot but this is on RAID SSD, took only a few minutes to create them all. From thereon, each minute sees an update to the 500k metrics, not a new metric creation. As you can see on my config above, there is no throttling of any sort.

drawks commented 10 years ago

What version of python are you on? I'm similarly on fast SSD without any throttling configured. I haven't seen evidence of a memory leak, but I have seen problems with a large number of connections does seem to prevent the relay/aggregator from draining. What does the

carbon.relays.*.destinations.*.sent

graph look like?

frlinux commented 10 years ago

Here is a graph of the different relaysdestinations

carbon_relay_destinations

python2.7 CentOS 6.4 across the board.

drawks commented 10 years ago

So does the sum of sent track the number received? I'm confused as to what your 4 graphs are. 0.9.x should only have "sent", "queueUntilReady", "fullQueueDrops" and "attemptedRelays".

dkulikovsky commented 10 years ago

This patch fixes this problem:

--- a/lib/carbon/aggregator/rules.py
+++ b/lib/carbon/aggregator/rules.py
@@ -93,7 +93,7 @@ class AggregationRule:
       except:
         log.err("Failed to interpolate template %s with fields %s" % (self.output_template, extracted_fields))

-    self.cache[metric_path] = result
+    if result: self.cache[metric_path] = result
     return result

   def build_regex(self):

Story: first of all i decided that disabling any cache is not really good idea, so i've tried to find the correct way to solve this problem. First attempt was with redis, i moved all cache to redis database and set up the connection pool for carbon daemons, this didn't help obviously. Storing and fetching data in redis is much more expensive than working with memory. But while debugging values that was stored to redis, i found out that most of the values where None. Metric name as key and value is None. In my second attempt i've tried to fix cache usage. So i have returned to the native code and added some debugging for cache and the result was the same as with redis, tons of None values stored in cache. The patch above fixes cache usage by limiting values to be stored only to not None. I have profiled code with my patch and without it and it seems to be that storing None values in cache to make this function skip some metrics really fast doesn't reduce any cpu usage. So my patch shouldn't reduce capacity.

frlinux commented 10 years ago

@dkulikovsky nice. Am I right to think that a similar patch is needed for relay though?

dkulikovsky commented 10 years ago

In megacarbon branch this patch only needed where aggregator or relay with aggregated-consistent-hashing is used, both of them use rules.py to calculate destinations for metrics.

frlinux commented 10 years ago

I have applied that fix but this does not fix it for me, restarted a few hours ago and still grows rapidly as before (see graph)

carbon_relay_memusage

dkulikovsky commented 10 years ago

This patch works only in megacarbon branch.

zstyblik commented 10 years ago

Thanks for the patch, although I'm not sure if it fixed anything for us. But I've applied it anyway :)

punnie commented 10 years ago

This patch greatly improves memory usage when using relays and aggregators with aggregator-consistent-hashing in 0.9.12, but it still doesn't solve the issue for me.

After processing about 1M metrics, 2 relays + 2 aggregators grew until eating away about 2.4GiB RAM total, which is not that much problematic on its own. The problem is that the whole system isn't receiving any metrics right now, and the memory hasn't been freed.

Any pointers to what causes this?

bruce-lyft commented 9 years ago

I believe the root cause is an unbounded dict that is used as a cache for the mapping between metric name and the aggregation metric that it belongs to. The PR above adds an LRU cache for this instead of a naked python dict, with a new setting to tune it.

This change bounds the memory use at the expense of possibly more CPU (when there's a cache miss, you have to evaluate the aggregation rules).

jamesjuran commented 9 years ago

I can confirm that this is indeed the root cause. I implemented a similar LRU cache using lru-dict and solved the unbounded memory growth, but I haven't gotten it working well enough to submit.

bruce-lyft commented 9 years ago

@jamesjuran We've been running with this fix for 2 days with pretty good results:

klynch commented 9 years ago

To keep python 2.6 compatibility, you can add a requirement for the ordereddict package.

tristanbes commented 9 years ago

I can confirm that too.

Server is crashing because 100% of memory is used by python. What did you do to fix this please?

2015-03-30_1136

obfuscurity commented 9 years ago

@jamesjuran @bruce-lyft Anyone have a patch suitable for a PR?

mleinart commented 9 years ago

Has be2e4f0d919b4cbbaf6743eebde4e41e3c775e73 improved things for anyone?

I'm not sure a bounded LRU cache makes a lot of sense in this case. Say we have a relay with this problem:

1 million metrics per minute are passing through
Each of the metrics are sent once per minute (i.e. the relay sees each metric name once a minute)
The metrics come in roughly the same order each time (each sender is sending 60sec from the last send and the senders are at a constant time difference from each other)

If we have an LRU cache that can hold 100,000 metrics, we cache the first 100k results, the next 100k boots those last 100k from the cache, the next 100k boots the prior... Basically we just eat memory and CPU and never get the benefit of the cache.

I can see some scenarios where perhaps a bounded cache might be some benefit, e.g. there is a wide distribution of metric frequencies sent through (some each at 1s, 10s, 60s, 10m, whatever) but that doesnt seem like the typical cases carbon relay/aggregator sees.

I also wonder how much help this cache actually is - perhaps someone with a rather large aggregator rule set could try removing it to see what happens to CPU usage?

frlinux commented 9 years ago

I never responded to that thread, apologies. The issue we had on our setup was with relays using hashing. And it was a bug in the python packages we used, which had a memory leak on the hashlib (https://lists.fedoraproject.org/pipermail/scm-commits/2012-July/832476.html)

After patching this, we never got a problem on relays, rock stable since as long as you do not feed them too many metrics.

Never used carbon aggregator so cannot comment on that one.

bruce-lyft commented 9 years ago

@mleinart The problem with the fix in be2e4f0 is that memory use is still unbounded. In our case, we have new metrics coming in on a regular basis, with a short time horizon (collectd/statsd metrics that are per-host, for ephemeral hosts, but which are aggregated; for example, CPU average for a group of hosts). With only your change we see carbon-aggregator memory use grow without bound and eventually crash. I believe that a change like the PR I submitted (https://github.com/graphite-project/carbon/pull/335) is still required.

obfuscurity commented 9 years ago

@mleinart @bruce-lyft Would it make sense to offer something like #335 as an optional strategy?

obfuscurity commented 8 years ago

I believe the aggregator leak in #185 is almost certainly the result of feeding aggregated metrics back into the aggregator loop as described in #455 and #560. There's a proposed fix in #439 but it needs some TLC before we can merge it. There's also an external fix applied to https://github.com/criteo-forks/carbon/commit/825693c97e57c7cdc36d0b0d758694010ce2be53 but I'd like to get feedback from others before we consider either of these.

Note that I'm not seeing any memory leaks in relays running at moderate volume (100k points/sec).

/cc @mleinart @drawks @iain-buclaw-sociomantic

obfuscurity commented 8 years ago

Adding to my previous comment, I believe that any previous relay memory leaks have already been fixed in master. The aggregator "leak" that @bruce-lyft describes is a consequence of the buffer dictionary memory footprint as it grows to accommodate the maximum concurrent number of buffers. Any existing memory should be reused, and my tests corroborate this hypothesis.

I agree with @mleinart's sentiment about the LRU cache not being the right approach. This would short-circuit Carbon's intended design and effectively trade memory for CPU consumption.

I have some thoughts about expiring metric buffers but I'm not sure how to tackle that yet.

Unless anyone has a reproducible case for memory leaks that aren't a natural byproduct of datapoints being stored in Python dictionaries, I propose that we close this issue as resolved. I acknowledge that there are times where Carbon memory could grow unbounded (e.g. MAX_CACHE_SIZE = inf) but these are tunable options (e.g. MAX_CACHE_SIZE, MAX_QUEUE_SIZE) and Carbon provides us with the internal statistics needed to identify these conditions before they become fatal.

graphite-project / carbon

memory leak in relay and aggregator daemons #185