LoadBalancer Exporter Does Not Release Memory When Using StreamIDs for Metrics

nicacioliveira commented 6 days ago

Component(s)

exporter/loadbalancing

What happened?

Description

I’m facing an issue with high cardinality, and I’ve noticed that we need to implement a max_stale mechanism, similar to what is used in the delta-to-cumulative processor. This is because metrics with new streamIDs continue to grow over time, causing instances of the LoadBalancer to consume memory indefinitely.

Steps to Reproduce

I don’t have a specific way to reproduce this issue in a controlled environment, as it occurs in production. To manage it, I have to constantly restart the load-balancing pods to prevent memory exhaustion.

Evidence: To mitigate the issue, I’ve set a minimum of 25 pods, but after a few hours, memory becomes exhausted due to the lack of a max_stale mechanism. After several days, I’m forced to perform a full rollout to reset all the pods.

Collector version

v0.110.0

Environment information

Environment

Kubernetes cluster on EKS

OpenTelemetry Collector configuration

No response

Log output

No response

Additional context

No response

github-actions[bot] commented 6 days ago

Pinging code owners:

exporter/loadbalancing: @jpkrohling

See Adding Labels via Comments if you do not have permissions to add labels yourself.

atoulme commented 5 days ago

Please consider taking a heap dump with pprof so we can investigate what is happening.

nicacioliveira commented 4 days ago

Please consider taking a heap dump with pprof so we can investigate what is happening.

I'll take some time to do this, but it seems clear. Using streamID-based routing, we will eventually run out of memory because we will keep streams in memory forever. The balancing here needs a "refresh" as the same in the delta-to-cumulative processor

open-telemetry / opentelemetry-collector-contrib