Closed phemmer closed 2 months ago
I narrowed it down to this PR: https://github.com/influxdata/telegraf/pull/13056 at least for the TCPListener's growth. @srebhan can you pick this up on Monday?
The behavior before this was lots of messages about connection not a TCP connection (*tls.Conn)
after every message.
I have been using the following config:
[agent]
debug = true
omit_hostname = true
flush_interval = "1s"
[[inputs.socket_listener]]
service_address = "tcp://:8001"
tls_cert = "testutil/pki/servercert.pem"
tls_key = "testutil/pki/serverkey.pem"
tls_allowed_cacerts = ["testutil/pki/server.pem", "testutil/pki/client.pem"]
keep_alive_period = "5s"
[[outputs.file]]
And sending metrics like:
echo "metric value=$i $(date +%s%N)" | openssl s_client -connect localhost:8001 \
-cert /home/powersj/telegraf/testutil/pki/client.pem \
-key /home/powersj/telegraf/testutil/pki/clientkey.pem \
-CAfile /home/powersj/telegraf/testutil/pki/cacert.pem \
-verify 1 -tls1_2
After 10-20k metrics you can collect a profile (go tool pprof -png http://localhost:6060/debug/pprof/heap
) and start to see the issue appear. Certainly, by 30k.
These profiles are from checking out 121b6c8 and running:
Ok, I think I see what's going on. It's an issue with TLS connections.
This code is translating conn
to its underlying TCP connection. Which is then stored:
https://github.com/influxdata/telegraf/blob/df78bc23f003d9f2aebc76c3275d90db334cb625/plugins/common/socket/stream.go#L137
However when the connection is torn down, it uses the *tls.Conn
, which isn't what got stored.
https://github.com/influxdata/telegraf/blob/df78bc23f003d9f2aebc76c3275d90db334cb625/plugins/common/socket/stream.go#L238
Solution is likely to store the *tls.Conn
, and not the underlying conn. So that it's consistent, and also so that when it is torn down, it starts from the top-most object in the hierarchy, and let it propagate down on its own.
@phemmer could you provide a PR fixing this?
I haven't had much time the last few days. But end of this week, or maybe next I very likely will.
@phemmer please test PR #15589 and let me know if this fixes the issue! Thanks for the inspiration to that PR!
Relevant telegraf.conf
Logs from Telegraf
System info
telegraf 1.30.2
Docker
No response
Steps to reproduce
...
Expected behavior
no infinite memory growth
Actual behavior
infinite memory growth
Additional info
So a few days ago we encountered an OOM on one of our hosts, which was due to telegraf. On another host we also noticed elevated memory usage from telegraf, as it was consuming approximately 41GB. After enabling pprof, heap profiles were captured a few days apart (where telegraf had gone up to a few GB memory). It's clear from the profiles that there are several memory leaks going on.
We did not start experiencing this issue until upgrading to version 1.30.2 from 1.26.0