open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.07k stars 2.37k forks source link

Collector randomly stops sending spans #31758

Closed MarcinGinszt closed 1 week ago

MarcinGinszt commented 8 months ago

Component(s)

No response

What happened?

Description

Otel-collector randomly stops sending spans. We encountered this situation twice this week. It happens to just one of the collector pods, the rest works correctly. We are notified by alert about sending queue being full- after inspecting pod metrics, turns out that it is caused by otelcol_exporter_sent_spans dropping to 0. image

There is nothing in the logs before the error about sending queue being full.

Are there some additional ways to diagnose the issue before resorting to pprof?

Steps to Reproduce

Expected Result

Actual Result

Collector version

0.95.0

Environment information

Environment

https://github.com/utilitywarehouse/opentelemetry-manifests

OpenTelemetry Collector configuration

receivers:
  otlp:
    protocols:
      grpc:
        keepalive:
          server_parameters:
            max_connection_age: 5m
            max_connection_age_grace: 1m
            max_connection_idle: 10m
        # Accept up to 4MB message
        max_recv_msg_size_mib: 4
      http:

processors:
  memory_limiter:
    check_interval: 5s
    limit_percentage: 80
    spike_limit_percentage: 20
  batch:
    # Kafka is limited to a 128MB payload, so we keep in mind
    # that as we can receive up to 4MB messages, we need to
    # keep the batching size low enough to not exceed Kafka's.
    timeout: 10ms
    send_batch_size: 30
    send_batch_max_size: 30
  resource:
    attributes:
      - key: deployment.environment
        value: prod
        action: insert
(... some other attributes)

extensions:
  health_check: {}
  zpages: {}

exporters:
  kafka:
    protocol_version: 2.6.0
    client_id: "otel-collector"
    timeout: 2s
    partition_traces_by_id: true
    brokers:
      - kafka1.svc.cluster:9092
      - kafka2.svc.cluster:9092
      - kafka3.svc.cluster:9092
    topic: "otel.otlp_spans"
    auth:
      tls:
        ca_file: /kafka-client-certificate/ca.crt
        cert_file: /kafka-client-certificate/tls.crt
        key_file: /kafka-client-certificate/tls.key
        reload_interval: 1h
    retry_on_failure:
      initial_interval: 2s
      max_interval: 10s
      max_elapsed_time: 60s
    sending_queue:
      num_consumers: 20
      queue_size: 12000 # 200 req/s * 60s
    producer:
      max_message_bytes: 125829120 # 120MB
      compression: zstd

service:
  extensions: [health_check, zpages]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, resource, batch]
      exporters: [kafka]

Log output

No response

Additional context

No response

github-actions[bot] commented 8 months ago

Pinging code owners for exporter/kafka: @pavolloffay @MovieStoreGuy. See Adding Labels via Comments if you do not have permissions to add labels yourself.

MarcinGinszt commented 7 months ago

We are continuing to encounter this issue once- twice daily. It happens only on our environment with biggest traffic. It's not a matter of resource consumption- resource usage is around 50% of Kubernetes limits. We have three collector pods running- it affects any number of them (1, 2 or 3) simultaneously (e.g. - two pods are stopping to produce at the exact same moment, third one works as usual). Restarting the pods fixes the situation.

There is nothing in the logs (debug level).

We analyzed the pprof profiles and goroutines graphs for working and broken collector- broken collector doesn't run sarama producer process.

I'm attaching the profile graphs here (for ok and broken collectors- for comparison)

Ok profile: ok Broken profile: broken

Ok goroutine: ok-goroutine Broken goroutine: broken-goroutine

MarcinGinszt commented 7 months ago

EDIT: this doesn't happen for each broken pod- most of them don't record any error

We found some error with the debug/tracez endpoint: image

looks like publishing process silently terminates because of context deadline exceeded in opentelemetry.proto.collector.trace.v1.traceservice/export

github-actions[bot] commented 5 months ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

rkargMsft commented 4 months ago

We're encountering this same issue using:

otel/opentelemetry-collector-k8s:0.102.1

No errors logged other than the send queue being full.

rkargMsft commented 4 months ago

May be https://github.com/open-telemetry/opentelemetry-collector/pull/10315

github-actions[bot] commented 2 months ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] commented 1 week ago

This issue has been closed as inactive because it has been stale for 120 days with no activity.