open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
2.94k stars 2.29k forks source link

[Exporter/LoadBalncer] Increased Memory Utilization after bumping from 0.94.0 to 0.99.0 #33435

Open NickAnge opened 3 months ago

NickAnge commented 3 months ago

Component(s)

exporter/loadbalancing

What happened?

Description

Hello team.

We recently upgraded our internal collectors from version 0.94.0 to 0.99.0, and we observed a rise in memory usage at the load balancer deployment collectors, as depicted in the image below. This persisted even after updating to the latest version, 0.101.0.

Screenshot 2024-06-07 at 19 04 31

We enabled profiling to our collectors (pprof ) component observed inuse_memory and inuse_objects. I seperated by investigation between 3 pods with low, medium and high memory usage.

Inuse Memory - Top

Low Memory Usage Pod

Screenshot 2024-06-07 at 19 08 07

Medium Memory Usage Pod

Screenshot 2024-06-07 at 19 08 40

High Memory Usage Pod

Screenshot 2024-06-07 at 19 08 48

Inuse_objects - top

Low Memory Usage Pod

Screenshot 2024-06-07 at 19 10 19

Medium Memory Usage Pod

Screenshot 2024-06-07 at 19 10 02

High Memory Usage Pod

Screenshot 2024-06-07 at 19 10 12

Steps to Reproduce

  1. Deployment mode used as Load Balancer with version 0.94.0
  2. Bump the version to 0.101.0

Expected Result

Expected result was the memory to remain the same over time, after the bump of the version

Actual Result

High memory usage after bumping the version

Collector version

0.101.0

Environment information

Environment

OS: (e.g., "Ubuntu 20.04") Compiler(if manually compiled): (e.g., "go 14.2")

OpenTelemetry Collector configuration

receivers:
  otlp:
    protocols:
      grpc:
        max_recv_msg_size_mib: 20

processors:
  memory_limiter:
    check_interval: 1s
    limit_percentage: 95
    spike_limit_percentage: 15
  k8sattributes:
    passthrough: true

exporters:
  loadbalancing/spans:
    protocol:
      otlp:
        sending_queue:
          enabled: true
          num_consumers: 100
          queue_size: 500
        retry_on_failure:
          enabled: true
          initial_interval: 2s
          max_interval: 2s
          max_elapsed_time: 10s
        tls:
          insecure: true
        timeout: 1
    resolver:
      k8s:
        service: service
  loadbalancing/metrics:
    routing_key: metric
    protocol:
      otlp:
        sending_queue:
          enabled: true
          num_consumers: 50
          queue_size: 500
        retry_on_failure:
          enabled: true
          initial_interval: 2s
          max_interval: 2s
          max_elapsed_time: 10s
        tls:
          insecure: true
        timeout: 1
    resolver:
      k8s:
        service: service

extensions:
  health_check:
  pprof:
    endpoint: :1777

service:
  extensions: [ health_check , pprof]
  pipelines:
    traces:
      receivers: [ otlp ]
      processors: [ memory_limiter ]
      exporters: [ loadbalancing/spans ]
    logs:
      receivers: [ otlp ]
      processors: [ memory_limiter ]
      exporters: [ loadbalancing/spans ]
    metrics:
      receivers: [ otlp ]
      processors: [ memory_limiter, k8sattributes ]
      exporters: [ loadbalancing/metrics ]

Log output

No response

Additional context

No response

github-actions[bot] commented 3 months ago

Pinging code owners:

jpkrohling commented 3 months ago

Thank you for the detailed report, I'll take a look and try to reproduce it. In the meantime, can you try switching to the DNS resolver instead of the k8s resolver? I'm not 100% sure yet it would show a difference, but the DNS resolver is known to consume fewer resources in other situations.

    resolver:
      k8s:
        service: service
NickAnge commented 3 months ago

Thanks @jpkrohling . We have discussed internally the replacement of the K8s resolver with dns resolver. The conclusion was to stay with K8s resolver as it is faster into computing/resolve the endpoints of the backing collectors in case of rollout or outage.

Let me know if you need me to provide some more information about the issue, and thanks a lot for taking a look

jpkrohling commented 3 months ago

Can you temporarily replace it, and see if the memory profile is different? If we can isolate this behavior to this resolver specifically, it's easier to find a solution.

NickAnge commented 3 months ago

This memory issue happened to our production environments only (probably because of higher traffic), so I am not sure if we can change it there even if it is temporarily :/. Did you manage to reproduce at your setup ?

jpkrohling commented 3 months ago

I wasn't able to try it out. I might be able to find some time later this week, but next week I'm AFK again. If anyone is interested in this issue, it would help me a lot if I can have a confirmation that this is isolated to the k8s resolver.

github-actions[bot] commented 1 month ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

dmedinag commented 2 weeks ago

just pinging here the owner of exporter/loadbalancing: @jpkrohling to avoid having this issue stale