influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.53k stars 5.56k forks source link

memory leak #15509

Closed phemmer closed 2 months ago

phemmer commented 3 months ago

Relevant telegraf.conf

[[inputs.socket_listener]]
service_address = "tcp://1.2.3.4:8085"
tls_cert = "/etc/ssl/host/cert.crt"
tls_key = "/etc/ssl/host/cert.key"
tls_allowed_cacerts = ["/usr/local/share/ca-certificates/edge-ca.crt"]
content_encoding = "gzip"
keep_alive_period = "2m"

[[inputs.socket_listener]]
service_address = "tcp://:8185"
tls_cert = "/etc/ssl/host/cert.crt"
tls_key = "/etc/ssl/host/cert.key"
tls_allowed_cacerts = ["/usr/local/share/ca-certificates/edge-ca.crt"]
content_encoding = "gzip"
keep_alive_period = "2m"

Logs from Telegraf

.

System info

telegraf 1.30.2

Docker

No response

Steps to reproduce

  1. run the above config for a while
  2. ...

Expected behavior

no infinite memory growth

Actual behavior

infinite memory growth

Additional info

So a few days ago we encountered an OOM on one of our hosts, which was due to telegraf. On another host we also noticed elevated memory usage from telegraf, as it was consuming approximately 41GB. After enabling pprof, heap profiles were captured a few days apart (where telegraf had gone up to a few GB memory). It's clear from the profiles that there are several memory leaks going on.

heap

We did not start experiencing this issue until upgrading to version 1.30.2 from 1.26.0

powersj commented 3 months ago

I narrowed it down to this PR: https://github.com/influxdata/telegraf/pull/13056 at least for the TCPListener's growth. @srebhan can you pick this up on Monday?

The behavior before this was lots of messages about connection not a TCP connection (*tls.Conn) after every message.

How to reproduce

I have been using the following config:

[agent]
  debug = true
  omit_hostname = true
  flush_interval = "1s"

[[inputs.socket_listener]]
  service_address = "tcp://:8001"
  tls_cert = "testutil/pki/servercert.pem"
  tls_key = "testutil/pki/serverkey.pem"
  tls_allowed_cacerts = ["testutil/pki/server.pem", "testutil/pki/client.pem"]
  keep_alive_period = "5s"

[[outputs.file]]

And sending metrics like:

echo "metric value=$i $(date +%s%N)" | openssl s_client -connect localhost:8001 \
    -cert /home/powersj/telegraf/testutil/pki/client.pem \
    -key /home/powersj/telegraf/testutil/pki/clientkey.pem \
    -CAfile /home/powersj/telegraf/testutil/pki/cacert.pem \
    -verify 1 -tls1_2

After 10-20k metrics you can collect a profile (go tool pprof -png http://localhost:6060/debug/pprof/heap) and start to see the issue appear. Certainly, by 30k.

Profiles

These profiles are from checking out 121b6c8 and running:

10k messages ![image](https://github.com/influxdata/telegraf/assets/6453401/1bfdf071-57b2-4777-9bcd-5959aeba30a6)
20k messages ![image](https://github.com/influxdata/telegraf/assets/6453401/a85180ac-4d5e-4521-998a-1692e3cfb8db)
30k messages ![image](https://github.com/influxdata/telegraf/assets/6453401/ecf6e24f-acd5-4c8d-83a1-d94b9cae95d4)
30k messages with parent commit (110287f) ![debug_110287f-30k-1718395635](https://github.com/influxdata/telegraf/assets/6453401/89d314e9-59fd-4b37-90cc-7dd35b7af8ef)
phemmer commented 3 months ago

Ok, I think I see what's going on. It's an issue with TLS connections.

https://github.com/influxdata/telegraf/blob/df78bc23f003d9f2aebc76c3275d90db334cb625/plugins/common/socket/stream.go#L125-L127

This code is translating conn to its underlying TCP connection. Which is then stored: https://github.com/influxdata/telegraf/blob/df78bc23f003d9f2aebc76c3275d90db334cb625/plugins/common/socket/stream.go#L137

However when the connection is torn down, it uses the *tls.Conn, which isn't what got stored. https://github.com/influxdata/telegraf/blob/df78bc23f003d9f2aebc76c3275d90db334cb625/plugins/common/socket/stream.go#L238

Solution is likely to store the *tls.Conn, and not the underlying conn. So that it's consistent, and also so that when it is torn down, it starts from the top-most object in the hierarchy, and let it propagate down on its own.

srebhan commented 3 months ago

@phemmer could you provide a PR fixing this?

phemmer commented 3 months ago

I haven't had much time the last few days. But end of this week, or maybe next I very likely will.

srebhan commented 2 months ago

@phemmer please test PR #15589 and let me know if this fixes the issue! Thanks for the inspiration to that PR!