open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
2.97k stars 2.3k forks source link

OpenTelemetry Collector does not gracefully shutdown, losing metrics on spot instance termination #33441

Open Rommmmm opened 3 months ago

Rommmmm commented 3 months ago

Component(s)

datadogexporter

What happened?

Description

We are currently experiencing an issue with the OpenTelemetry Collector running in our Kubernetes cluster, which is managed by Karpenter. Our setup involves spot instances, and we've noticed that when Karpenter terminates these instances, the OpenTelemetry Collector does not seem to shut down gracefully. Consequently, we are losing metrics and traces that are presumably still in the process of being processed or exported.

Steps to Reproduce

  1. Deploy the OpenTelemetry Collector on a Kubernetes cluster with Karpenter managing spot instances.
  2. Simulate a spot instance termination (or just teminate a node in the cluster).
  3. Observe that the metrics and traces during the termination period are lost.

Expected Result

The OpenTelemetry Collector should flush all pending metrics and traces before shutting down to ensure no data is lost during spot instance termination.

Actual Result

During a spot termination event triggered by Karpenter, the OpenTelemetry Collector shuts down without flushing all the data, causing loss of metrics and traces.

Collector version

0.95.0

Environment information

Environment

Kubernetes Version: 1.27 Karpenter Version: 0.35.2 Cloud Provider: AWS

OpenTelemetry Collector configuration

connectors:
  datadog/connector: null
exporters:
  datadog:
    api:
      fail_on_invalid_key: true
      key: <KEY>
      site: <SITE>
    host_metadata:
      enabled: false
    metrics:
      histograms:
        mode: distributions
        send_count_sum_metrics: true
      instrumentation_scope_metadata_as_tags: true
      resource_attributes_as_tags: true
      sums:
        cumulative_monotonic_mode: raw_value
    retry_on_failure:
      enabled: true
      initial_interval: 1s
      max_elapsed_time: 600s
      max_interval: 20s
    sending_queue:
      enabled: true
      num_consumers: 100
      queue_size: 3000
    traces:
      trace_buffer: 30
  debug: {}
  logging: {}
extensions:
  health_check:
    endpoint: <HEALTHCHECK>
processors:
  batch:
    send_batch_max_size: 3000
    send_batch_size: 2000
    timeout: 3s
  memory_limiter:
    check_interval: 5s
    limit_mib: 1800
    spike_limit_mib: 750
receivers:
  carbon:
    endpoint: <CARBON>
  otlp:
    protocols:
      grpc:
        endpoint: <ENDPOINT>
      http:
        endpoint: <ENDPOINT>
  prometheus:
    config:
      scrape_configs:
      - job_name: <JOB_NAME>
        scrape_interval: 30s
        static_configs:
        - targets:
          - <ENDPOINT>
  statsd:
    aggregation_interval: 60s
    endpoint: <ENDPOINT>
service:
  extensions:
  - health_check
  pipelines:
    logs:
      exporters:
      - datadog
      processors:
      - memory_limiter
      - batch
      - resource
      receivers:
      - otlp
    metrics:
      exporters:
      - datadog
      processors:
      - memory_limiter
      - batch
      - resource
      receivers:
      - otlp
      - carbon
      - statsd
      - prometheus
      - datadog/connector
    traces:
      exporters:
      - datadog
      - datadog/connector
      processors:
      - memory_limiter
      - batch
      - resource
      receivers:
      - otlp
  telemetry:
    metrics:
      address: <ENDPOINT>

Log output

No response

Additional context

I noticed that there is a terminationGracePeriodSeconds configuration in Kubernetes deployment that can give workloads more time to shutdown. However, this option does not seem to be exposed in the OpenTelemetry Collector Helm chart.

I would like to suggest the following enhancements:

  1. Expose the terminationGracePeriodSeconds parameter in the Helm chart to allow users to specify a custom grace period.
  2. Review the shutdown procedure of the OpenTelemetry Collector to ensure that it attempts to flush all buffered data before exiting.
github-actions[bot] commented 3 months ago

Pinging code owners for exporter/datadog: @mx-psi @dineshg13 @liustanley @songy23 @mackjmr @ankitpatel96. See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] commented 3 months ago

Pinging code owners for exporter/datadog: @mx-psi @dineshg13 @liustanley @songy23 @mackjmr @ankitpatel96. See Adding Labels via Comments if you do not have permissions to add labels yourself.

songy23 commented 3 months ago

@Rommmmm could you try upgrade to v0.102.0 and see if the issue persists? This should have been fixed in https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/33291.

kevinh-canva commented 3 months ago

Hi, I'm seeing the same issue. And updating to v0.102 doesn't help, we are still losing metrics

Rommmmm commented 3 months ago

@songy23 sorry for taking so long but unfortunately upgrading didn't help

ancostas commented 3 months ago

@Rommmmm Does the collector not gracefully shut down at all, or is it being killed before it can shut down gracefully?

The mention of terminationGracePeriodSeconds makes it sound like the latter, which may be user error (i.e. a process can't finish it's exit routine if it is forcefully interrupted and killed in the middle of it).

Rommmmm commented 2 months ago

@Rommmmm Does the collector not gracefully shut down at all, or is it being killed before it can shut down gracefully?

The mention of terminationGracePeriodSeconds makes it sound like the latter, which may be user error (i.e. a process can't finish it's exit routine if it is forcefully interrupted and killed in the middle of it).

Its not gracefully shut down

ancostas commented 1 month ago

@Rommmmm is it being killed or terminated? Processes being killed is not a graceful shutdown scenario AFAIK.

What I'm guessing is happening is that your terminationGracePeriodSeconds is too short, so while the process is shutting down gracefully (e.g. flushing queued data to a vendor backend), the control plane simply kills it since it took too long.