open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.02k stars 2.34k forks source link

OTel Collector 0.104.0+ issues when using linkerd-proxy side car container #34565

Open Tyrion85 opened 2 months ago

Tyrion85 commented 2 months ago

Component(s)

No response

What happened?

Description

When using Opentelemetry collector 0.104.0+ (up to 0.106.1), linkerd-proxy logs an enormous amount of "HTTP service in fail-fast" logs, and has a high cpu usage (100x normal cpu usage).

Screenshot 2024-08-09 at 15 02 47 Screenshot 2024-08-09 at 15 15 13

This issue might as well be posted in linkerd community - but linkerd is a generic proxy, and Opentelemetry collector 0.103.1 doesn't result in these issues.

Relevant collector configuration:

apiVersion: v1
data:
  relay: |
    connectors:
      spanmetrics:
        aggregation_temporality: AGGREGATION_TEMPORALITY_CUMULATIVE
        dimensions:
        - default: GET
          name: http.method
        - name: http.status_code
        dimensions_cache_size: 50
        events:
          dimensions:
          - name: exception.type
          - name: exception.message
          enabled: true
        exclude_dimensions:
        - status.code
        exemplars:
          enabled: true
        histogram:
          explicit:
            buckets:
            - 10ms
            - 100ms
            - 250ms
            - 500ms
            - 750ms
            - 1s
            - 1500ms
            - 2s
            - 5s
        metrics_expiration: 0
        metrics_flush_interval: 15s
        resource_metrics_key_attributes:
        - service.name
        - telemetry.sdk.language
        - telemetry.sdk.name
    exporters:
      debug: {}
      otlp/quickwit:
        endpoint: ...
        tls:
          insecure: true
      prometheusremotewrite:
        endpoint: http://prometheus-operated.monitoring:9090/api/v1/write
        target_info:
          enabled: true
    extensions:
      health_check:
        endpoint: ${env:MY_POD_IP}:13133
    processors:
      batch: {}
      memory_limiter:
        check_interval: 5s
        limit_percentage: 80
        spike_limit_percentage: 25
      span/to_attributes:
        name:
          to_attributes:
            rules:
            .....
    receivers:
      jaeger:
        protocols:
          grpc:
            endpoint: ${env:MY_POD_IP}:14250
          thrift_compact:
            endpoint: ${env:MY_POD_IP}:6831
          thrift_http:
            endpoint: ${env:MY_POD_IP}:14268
      opencensus: null
      otlp:
        protocols:
          grpc:
            endpoint: ${env:MY_POD_IP}:4317
          http:
            cors:
              allowed_origins:
              - ....
            endpoint: ${env:MY_POD_IP}:4318
      prometheus:
        config:
          scrape_configs:
          - job_name: opentelemetry-collector
            scrape_interval: 10s
            static_configs:
            - targets:
              - ${env:MY_POD_IP}:8888
      zipkin:
        endpoint: ${env:MY_POD_IP}:9411
    service:
      extensions:
      - health_check
      pipelines:
        logs:
          exporters:
          - debug
          processors:
          - memory_limiter
          - batch
          receivers:
          - otlp
        metrics:
          exporters:
          - prometheusremotewrite
          processors:
          - memory_limiter
          - batch
          receivers:
          - spanmetrics
        traces:
          exporters:
          - otlp/quickwit
          - spanmetrics
          processors:
          - batch
          - span/to_attributes
          receivers:
          - otlp
          - opencensus
          - zipkin
          - jaeger
      telemetry:
        metrics:
          address: ${env:MY_POD_IP}:8888
kind: ConfigMap
....

Steps to Reproduce

Opentelemetry collector 0.104.0 or 0.106.1 Linkerd 2.12.2 (but I suspect linkerd is just highlighting some other issue, that's how it usually goes with this service mesh) Storage I don't think matters

Expected Result

It looks like some sort of regression, as 0.103.1 works fine.

Actual Result

Collector version

0.104.0

Environment information

Environment

OS: (e.g., "Ubuntu 20.04") Compiler(if manually compiled): (e.g., "go 14.2")

OpenTelemetry Collector configuration

relay: |
    connectors:
      spanmetrics:
        aggregation_temporality: AGGREGATION_TEMPORALITY_CUMULATIVE
        dimensions:
        - default: GET
          name: http.method
        - name: http.status_code
        dimensions_cache_size: 50
        events:
          dimensions:
          - name: exception.type
          - name: exception.message
          enabled: true
        exclude_dimensions:
        - status.code
        exemplars:
          enabled: true
        histogram:
          explicit:
            buckets:
            - 10ms
            - 100ms
            - 250ms
            - 500ms
            - 750ms
            - 1s
            - 1500ms
            - 2s
            - 5s
        metrics_expiration: 0
        metrics_flush_interval: 15s
        resource_metrics_key_attributes:
        - service.name
        - telemetry.sdk.language
        - telemetry.sdk.name
    exporters:
      debug: {}
      otlp/quickwit:
        endpoint: ...
        tls:
          insecure: true
      prometheusremotewrite:
        endpoint: http://prometheus-operated.monitoring:9090/api/v1/write
        target_info:
          enabled: true
    extensions:
      health_check:
        endpoint: ${env:MY_POD_IP}:13133
    processors:
      batch: {}
      memory_limiter:
        check_interval: 5s
        limit_percentage: 80
        spike_limit_percentage: 25
      span/to_attributes:
        name:
          to_attributes:
            rules:
            .....
    receivers:
      jaeger:
        protocols:
          grpc:
            endpoint: ${env:MY_POD_IP}:14250
          thrift_compact:
            endpoint: ${env:MY_POD_IP}:6831
          thrift_http:
            endpoint: ${env:MY_POD_IP}:14268
      opencensus: null
      otlp:
        protocols:
          grpc:
            endpoint: ${env:MY_POD_IP}:4317
          http:
            cors:
              allowed_origins:
              - ....
            endpoint: ${env:MY_POD_IP}:4318
      prometheus:
        config:
          scrape_configs:
          - job_name: opentelemetry-collector
            scrape_interval: 10s
            static_configs:
            - targets:
              - ${env:MY_POD_IP}:8888
      zipkin:
        endpoint: ${env:MY_POD_IP}:9411
    service:
      extensions:
      - health_check
      pipelines:
        logs:
          exporters:
          - debug
          processors:
          - memory_limiter
          - batch
          receivers:
          - otlp
        metrics:
          exporters:
          - prometheusremotewrite
          processors:
          - memory_limiter
          - batch
          receivers:
          - spanmetrics
        traces:
          exporters:
          - otlp/quickwit
          - spanmetrics
          processors:
          - batch
          - span/to_attributes
          receivers:
          - otlp
          - opencensus
          - zipkin
          - jaeger
      telemetry:
        metrics:
          address: ${env:MY_POD_IP}:8888

Log output

No response

Additional context

No response

github-actions[bot] commented 1 week ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.