open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.13k stars 2.4k forks source link

Kafka exporter is not able to keep up to the load being stream to OpenTelemetry collector. #35208

Open Om7771 opened 2 months ago

Om7771 commented 2 months ago

Component(s)

exporter/kafka

What happened?

Description

We had initiated the load of 1700 TPS towards the open telemetry collector and observed that Kafka exporter was not able to consume the whole load and stream it to Kafka topic at the same rate as were generating the load.

To debug we enabled the telemetry metrics level: detailed expecting a detailed metrices level, But observed that very few metrices are shown in the log.

The metrices shown in the logs were less than what we expected for metrices level: Basic

The Kafka exporter use this: https://gitlabce.tools.aws.vodafone.com/IOT/dsip-opentelemetry.git

The documentation refereed for metrices https://opentelemetry.io/docs/collector/internal-telemetry/

Steps to Reproduce

  1. Start a load 1700 TPS towards Opentelemetry collector
  2. We verified the lag of the topic on which Kafka exporter was writing the data.

Expected Result

  1. The Kafka exporter was able to consume the load streamed to OpenTelemetry collector without any errors.
  2. Verified from the offsets of the topic that Kafka exporter was writing data to topic at the same speed at which data was being pushed to the OpenTelemetry Collector.

Actual Result

  1. The Kafka exporter was able to consume the load streamed to OpenTelemetry collector without any errors.
  2. From the offset of the topic if was observed that Kafka exporter was writing to the topic at reduced speed with respect to data being pushed to OpenTelemetry Collector.

Collector version

0.96.0

Environment information

Environment

OS: alpine (Running as a containerized image on EKS)

OpenTelemetry Collector configuration

#Following is the config for Kafka exporter in our environment
receivers:
  otlp:
    protocols:
      grpc:
        auth:
          authenticator: basicauth/server
        tls:
          cert_file: ***
          key_file: ***
          ca_file: ***
      http:
        auth:
          authenticator: basicauth/server
        tls:
          cert_file: ***
          key_file: ***
          ca_file: ***
        # TODO - CORS is not configured yet
exporters:
  debug:
    verbosity: detailed
  logging:
    verbosity: detailed
    sampling_initial: 5
    sampling_thereafter: 200
  kafka:
    brokers: [ '*****' ]
    topic: ****
    auth:
      sasl:
        username: ****
        password: ***
        mechanism: SCRAM-SHA-512
      # TODO - appropriate certs must be set
      tls:
        insecure: true
    encoding: otlp_json
    protocol_version: 2.6.2
    metadata:
      retry:
        max: 3
        backoff: 250ms
    timeout: 5s
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 30s
      max_elapsed_time: 120s
    sending_queue:
      enabled: true
      num_consumers: 20
      queue_size: 2000
    producer:
      # Set to 5MB and compression should be enough to keep it below 1M (Kafka's limit)
      max_message_bytes: 5000000
      required_acks: 1
      compression: 'lz4'
      flush_max_messages: 0

processors:
  batch:
    send_batch_size: 5000
    send_batch_max_size: 8000
    timeout: 0s

extensions:
  basicauth/server:
    htpasswd:
      file: ***
  health_check:
    path: "/health"
    tls:
      cert_file: ***
      key_file: ***
      ca_file: ***

service:
  telemetry:
    logs:
      level: "debug"
    metrics:
      level: detailed
      address: ":9404"
  extensions: [basicauth/server, health_check]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [kafka, debug]

Log output

No response

Additional context

No response

github-actions[bot] commented 2 months ago

Pinging code owners:

VihasMakwana commented 2 months ago

Thanks for filing this @Om7771.

Om7771 commented 2 months ago

@VihasMakwana Thanks a lot for the quick response. 1) I cannot confirm if this is regression as we have been using version 0.96.0 from beginning. 2) There are no error logs printed. The only logs generated were the traces getting printed on stdout.

Regarding exporting to Kafka using SyncProducer, we have set exporter.kafka.producer.required_acks=1. As per https://pkg.go.dev/github.com/Shopify/sarama#RequiredAcks, this means that the exporter only waits for responding broker to ack. When we set this to 0 (an unreliable delivery) this effectively becomes async. Do you mean this mode?