open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.02k stars 2.33k forks source link

Spans keep getting dropped after Kafka Nodes are rotated. #12208

Closed Harnoor-se7en closed 1 year ago

Harnoor-se7en commented 2 years ago

Hi Folks.

Describe the bug We are using Kafka Exporter for Otel-collector. We have observed that whenever Kafka nodes are rotated, Otel-Collector keeps dropping spans even after Kafka nodes are rotated successfully. We have to restart the collector pods, after which spans are processed successfully by the collector. Logs we can see simultaneously are Dropping data because sending_queue is full. Try increasing queue_size. {"kind": "exporter", "name": "kafka", "dropped_items": 12} and batchprocessor/batch_processor.go:185 Sender failed {"kind": "processor", "name": "batch", "pipeline": "traces/2", "error": "sending_queue is full"} Screenshot 2022-07-09 at 11 06 21 PM Metric: sum(rate(otelcol_receiver_refused_spans[5m]))

Steps to reproduce Rotate the Kafka nodes during ingestion of spans.

What did you expect to see? We expect, Otel-Collector to successfully connect with Kafka brokers after brokers are rotated. And in case it is unable to connect with it, relevant logs should be printed. We don't want to restart the collector for the same event every time.

What did you see instead? Otel-Collector is unable to ingest spans and refuses spans after Kafka nodes are rotated successfully.

What version did you use? otel/opentelemetry-collector-contrib:0.48.0

What config did you use?

otel-collector-config: |
    receivers:
      jaeger:
        protocols:
          grpc:
          thrift_http:
      zipkin:
    processors:
      batch:
      memory_limiter:
        limit_mib: 5000
        # 25% of limit up to 2G
        spike_limit_mib: 1048
        check_interval: 1s
    extensions:
      health_check: {}
      zpages: {}
      memory_ballast:
        size_mib: 2048
    exporters:
      kafka:
        brokers:
        {{- range $endpoints := .Values.kafka_endpoint_list }}
          - {{ $endpoints.item }}
        {{- end }}
        protocol_version: {{ default "2.6.0" .Values.kafka_protocol_version}}
        topic: {{ .Values.kafka_topic}}
        encoding: {{ default "jaeger_proto" .Values.kafka_encoding}}
        auth:
          tls:
            ca_file: {{  }}
            cert_file: {{ }}
            key_file: {{  }}
    service:
      extensions: [health_check, zpages]

Environment OS: Linux

Additional context We know that while Kafka nodes are rotated, sending queue will quickly get populated if the appropriate size is not configured. But at least there should be relevant logs in case Otel-Collector is unable to establish a connection with Kafka brokers. It would be helpful if we can know more about the retry mechanism of the collector while Kafka nodes are rotated. NOTE: By 'rotating brokers', I mean 'rolling deployment of K8s pods of broker'.

jkowall commented 2 years ago

What is "rotated" are you taking them down?

Harnoor-se7en commented 2 years ago

What is "rotated" are you taking them down?

By 'rotating' I meant 'rolling deployment of K8s pods'. will add this in the description too.

github-actions[bot] commented 2 years ago

Pinging code owners: @pavolloffay @MovieStoreGuy. See Adding Labels via Comments if you do not have permissions to add labels yourself.

pavolloffay commented 2 years ago

How much time does the rotation last?

Did you try to experiment with https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/exporter/kafkaexporter/config.go#L65?

Do you see any kafka-related logs in the collector?

github-actions[bot] commented 1 year ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] commented 1 year ago

This issue has been closed as inactive because it has been stale for 120 days with no activity.