Open nicacioliveira opened 6 days ago
Pinging code owners:
exporter/loadbalancing: @jpkrohling
See Adding Labels via Comments if you do not have permissions to add labels yourself.
Please consider taking a heap dump with pprof so we can investigate what is happening.
Please consider taking a heap dump with pprof so we can investigate what is happening.
I'll take some time to do this, but it seems clear. Using streamID-based routing, we will eventually run out of memory because we will keep streams in memory forever. The balancing here needs a "refresh" as the same in the delta-to-cumulative processor
Component(s)
exporter/loadbalancing
What happened?
Description
I’m facing an issue with high cardinality, and I’ve noticed that we need to implement a max_stale mechanism, similar to what is used in the delta-to-cumulative processor. This is because metrics with new streamIDs continue to grow over time, causing instances of the LoadBalancer to consume memory indefinitely.
Steps to Reproduce
I don’t have a specific way to reproduce this issue in a controlled environment, as it occurs in production. To manage it, I have to constantly restart the load-balancing pods to prevent memory exhaustion.
Evidence: To mitigate the issue, I’ve set a minimum of 25 pods, but after a few hours, memory becomes exhausted due to the lack of a max_stale mechanism. After several days, I’m forced to perform a full rollout to reset all the pods.
Collector version
v0.110.0
Environment information
Environment
Kubernetes cluster on EKS
OpenTelemetry Collector configuration
No response
Log output
No response
Additional context
No response