grobian / carbon-c-relay

Enhanced C implementation of Carbon relay, aggregator and rewriter
Apache License 2.0
380 stars 107 forks source link

Memory eating when replicating metrics or maybe due to SSL usage #439

Closed berrfred closed 2 years ago

berrfred commented 2 years ago

Hi, it seems that Carbon C Relay is eating up memory in a linear mode when using a configuration that duplicates each received metric towards two backend servers. No issue instead when sharding metrics between Carbon Cache instances, used memory remains constant in that case.

Is this a real bug or a bad configuration ?

Hereafter the used configuration:

# Receive external metrics (e.g. tgds) on encrypted connections
listen
  type linemode transport plain ssl /etc/pki/tls/private/tgds_2048_cert_key.pem
    8443 proto tcp
  ;

# Receive internal metrics (e.g. collectd) on clear connection
listen
  type linemode transport plain
    2443 proto tcp
  ;

# Replicate metrics towards Graphite backends (SANS-V1 and SANS-V2)
cluster graphite
  forward
    172.16.33.1:2443
    172.16.33.2:2443
  ;

# Relay own metrics
match ^carbon\.
  send to graphite
  stop
  ;

# Relay collectd metrics
match ^collectd\.
  send to graphite
  stop
  ;

# Relay tgds metrics
match ^tgds\.
  send to graphite
  stop
  ;

Below a Grafana representation of used memory:

image

grobian commented 2 years ago

I think the problem might be ssl, or does your shared configuration also use ssl listener?

grobian commented 2 years ago

(in any case, it shouldn't happen, so this is a bug)

berrfred commented 2 years ago

You are right this could be due to SSL usage...

When memory use is increasing we are not only forwarding to two backend servers but also receiving the metrics through SSL connection. Carbon C Relay on the backend servers has no issue but is sharding the metrics across local Carbon Cache instances and receiving in clear (no SSL use).

berrfred commented 2 years ago

Forgot to mention we are using release 3.7.2 on CentOS 7.

grobian commented 2 years ago

So I've been trying to reproduce this scenario, but I dont see leaks.

Perhaps odd question, but would it be possible to (briefly) run the relay under valgrind to confirm there are (unreachable) leaks? valgrind --leak-check=full carbon-c-relay -f ... should do the trick.

berrfred commented 2 years ago

So I've been trying to reproduce this scenario, but I dont see leaks.

Perhaps odd question, but would it be possible to (briefly) run the relay under valgrind to confirm there are (unreachable) leaks? valgrind --leak-check=full carbon-c-relay -f ... should do the trick.

Thank you for your time and suggestion, I'll try to run it with valgrind in our test plant asap. In the meantime I upgraded the release from 3.7.2 to 3.7.3 and things look better but we'll see in a couple of days.

image

berrfred commented 2 years ago

It looks like the memory issue has been solved in release 3.7.3, it is not growing any longer, I have upgraded in both test and production environments, same good behaviour.

image

berrfred commented 2 years ago

I propose to close this issue that is solved by release 3.7.3

grobian commented 2 years ago

Thanks for the feedback, it's kind of disturbing that I don't feel I ever made a fix that would result in this, but if 3.7.3 really is holding up stable under TLS use, then that should be what it is!