grafana / carbon-relay-ng

Fast carbon relay+aggregator with admin interfaces for making changes online - production ready
Other
467 stars 151 forks source link

Carbon-relay-ng never releases RAM when there is a lot of traffic, memory leak #221

Closed obersoldat closed 6 years ago

obersoldat commented 7 years ago

So in my current setup metrics are sent over keepalived server which forwards them to a haproxy server, which then balances them(layer 4) on the two carbon-relay-ng servers, in active-active setup. When I had no balancing protocol setup, one relay would have issues as it would get most of the traffic. RAM usage would pile up on the first, while on the other relay would be fine, no memory leaks. Both relays have 4 cores and 16 GB of RAM each. And what I have noticed is that up to 4 GB, the usage will be fine, releasing the RAM, but after that, it simply hoards all the RAM that it can. Once RAM is consumed, it will never be released. Up until OOM kills the process and starts it again. But then I lose some metrics that were queued at that time. I was hoping this was fixed in version carbon-relay-ng-0.9.2_2_g295c204-1.x86_64, but no luck. # grep -i oom /var/log/messages Sep 18 08:02:27 carbon-relay-ng-1 kernel: kthreadd invoked oom-killer: gfp_mask=0x3000d0, order=2, oom_score_adj=0 Sep 18 08:02:27 carbon-relay-ng-1 kernel: [<ffffffff81184cfe>] oom_kill_process+0x24e/0x3c0 Sep 18 08:02:27 carbon-relay-ng-1 kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name

Dieterbe commented 7 years ago

there are 2 known memory leaks:

furthermore, you should know that the go runtime will always hold on to memory, once it's requested from the OS, it will never be returned. for better or worse, this is a deliberate design characteristic of the go runtime, though the go team is discussing to change this. this measurement is thus more incidental and not that telling in terms of how much the go runtime actually has allocated, whether it's failing to properly release objects, etc. there's some good articles out there that explain this in more detail

I have just comitted c33700e which adds measurement of actual ram usage by the heap / allocated objects. Can you run the latest code, and import the latest dashboard from this repo? you'll then see both memory obtained from the system, and memory used by the heap, which will clarify what is going on.

finally, the big question is of course why is it allocating memory in the first place. this will depend on your configuration, in particular bufSize settings, and whether it needs to use the buffer.

can you reproduce this with the latest code and post a snapshot using the latest dashboard out of this repo?

bzed commented 6 years ago

@obersoldat You could check #222 and the pull request #248 and see if it fixes your issues. There is a memory leak on each re-connection to a destination.

obersoldat commented 6 years ago

@bzed Can no longer report issues. I have a Graphite cluster that monitors production, and can't have issues such as these. So I switched to carbon-c-relay couple of weeks after this thread and had absolutely no issues with it. And for pennies of resources compared.

Dieterbe commented 6 years ago

nothing actionable here. needed more info.