open-telemetry / opentelemetry-collector

OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
4.36k stars 1.44k forks source link

Throttle exporting from persistence queue to reduce memory consumption #11018

Open Nav-Kpbor opened 1 month ago

Nav-Kpbor commented 1 month ago

Is your feature request related to a problem? Please describe. My team and I have encountered an issue where our collectors consume high memory usage when re-ingesting telemetry from a file storage queue after a disruption event with our backend. In these tests we have simulated an hour of connection failures to the backend to let our file storage queue grow. After an hour has passed, we restore the connection and see the spike in exported telemetry and memory usage. image image Here is an example of the behavior we see from the persistence sending queue during the test period. Notice how the sending queue immediately drops to zero after reconnecting to the backend. image It seems like on reconnect with the backend, anything in the file storage queue gets consumed into a memory queue. We are hoping to control this memory spike so we can ensure memory won't pass a certain threshold when running on windows VMs

Describe the solution you'd like Is there a feature we can add that will throttle how quickly the consumers pull from a file storage queue and send to the backend endpoint? Something that allows us to configure how many batches are pulled from the queue over a specified time frame?

Describe alternatives you've considered We have tried utilizing the memory limiter and GOMEMLIMIT environment variable but neither have been successfully. My guess is the garbage collector won’t reclaim the memory since the telemetry is still being actively sent. We have also tried reducing the number of consumers and the size of batches but we are sill seeing the spiking.

Additional context Collector version: v0.99 contrib Tested on Windows 2016 server

Here is the config we used for testing in case there are any config changes we could make to improve memory usage with the current version of the collector.

extensions:
  health_check:
    endpoint: localhost:4313
  file_storage/backup:
    directory: {Directory of Collector on the machine}
    compaction:
      on_rebound: true
      directory: {Directory of Collector on the machine}
      rebound_needed_threshold_mib: 100
      rebound_trigger_threshold_mib: 10
      check_interval: 5s

receivers:
  otlp:
    protocols:
      grpc:
      http:

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 1024
  batch:
    send_batch_size: 8192
    send_batch_max_size: 8192
    timeout: 10s

exporters:   
  otlp:
    endpoint: "http://{IP of backend server}:4317"
    retry_on_failure:
      max_elapsed_time: 0
    sending_queue:
      queue_size: 1000
      storage: file_storage/backup
      num_consumers: 10
    tls:
      insecure: true

service:
  extensions: [health_check, file_storage/backup]
  telemetry:
    metrics:
      address: "localhost:4315"

  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp]

    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp]

    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp]
jmacd commented 3 weeks ago

The num_consumers setting in the persistent queue is capable of throttling the export path, consider lowering it to 1 and working back up if the recovery is too slow.