open-telemetry / opentelemetry-collector

OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
4.51k stars 1.48k forks source link

newSizedChannel does not properly close on exporter shutdown #11401

Open Tarmander opened 1 month ago

Tarmander commented 1 month ago

Describe the bug This is related to an issue with the exporter/loadbalancingexporter. The k8s resolver would continuously Shutdown() and create 2 new boundedMemoryQueue's every time the endpoints were "updated" (roughly every 3 minutes). This behavior went unnoticed until the Memory Limiter Processor started to drop spans.

After investigation with the pprof extension, we realized that we had an unbounded memory leak, and the root cause was that each time an exporter and subsequent queue were shutdown, the underlying channel was not GC'd properly. We'd continue allocating a new channel each update until we ran OOM.

image

Steps to reproduce Configure the k8s resolver to point to a service with many endpoints (the more endpoints the quicker the memory increase). You can run with the pprof extension to see the memory increase in newSizedChannel over time.

What did you expect to see? All exporters/queues/channels to be properly Shutdown() and GC'd.

What did you see instead? Channels in existing exporter queues were not disposed of, and eventually they used up all resources in the pod.

What version did you use? v0.105.0

What config did you use?

receivers:
  otlp:
    protocols:
      grpc: { }
      http: { }
processors:
  batch:
    timeout: 1s
  memory_limiter:
    check_interval: 5s
    limit_percentage: 80
    spike_limit_percentage: 20
exporters:
  loadbalancing:
    protocol:
      otlp:
        tls:
          insecure: true
        sending_queue:
          queue_size: 100000
          num_consumers: 25
    resolver:
      k8s:
        service: opentelemetry-global-gateway-collector-headless.opentelemetry-global-collector
extensions:
  health_check:
    endpoint: 0.0.0.0:13133
  zpages:
    endpoint: 0.0.0.0:55679
  pprof:
    endpoint: localhost:1777
service:
  extensions: [ health_check, zpages, pprof ]
  telemetry:
    logs:
      level: info
      encoding: json
    metrics:
      address: 0.0.0.0:8888
  pipelines:
    traces:
      receivers: [ otlp ]
      processors: [ memory_limiter, batch ]
      exporters: [ loadbalancing ]

Environment OS: Ubuntu 22.04 Compiler: go1.22.6 kubenertes version: apiVersion: opentelemetry.io/v1beta1

madaraszg-tulip commented 5 days ago

I have noticed a very similar issue using the loadbalancing exporter in grafana alloy which uses this component. Here are some pyroscope screenshots:

image

image

madaraszg-tulip commented 11 hours ago

https://github.com/open-telemetry/opentelemetry-collector/blob/v0.114.0/exporter/exporterhelper/internal/metadata/generated_telemetry.go#L58-L92

    _, err = builder.meter.RegisterCallback(func(_ context.Context, o metric.Observer) error {
        o.ObserveInt64(builder.ExporterQueueCapacity, cb(), opts...)
        return nil
    }, builder.ExporterQueueCapacity)

The return values of the RegisterCallback functions are ignored. They hold references that can be used to unregister these callbacks when the exporter is being shut down. I assume they should be returned to the caller exporter/exporterhelper/internal/queue_sender Start(), to be later used in Shutdown()