open-telemetry / opentelemetry-collector

OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
4.3k stars 1.42k forks source link

OpenTelemetry Collector causes log duplication in Fluent Bit over TLS due to timeout errors #11212

Open Morgan-Li opened 1 day ago

Morgan-Li commented 1 day ago

Describe the bug When sending logs from the OpenTelemetry Collector to Fluent Bit over TLS, the logs initially flow without any issues and are received and outputted to standard output on Fluentbit. However, after a brief period, the following error appears repeatedly, eventually hitting max retries and stopping:

2024-09-19T20:05:57.369Z info exporterhelper/retry_sender.go:118 Exporting failed. Will retry the request after interval. {"kind": "exporter", "data_type": "logs", "name": "otlphttp", "error": "failed to make an HTTP request: Post \"https://fluent-bit.morgan-certs.svc.cluster.local:4318/v1/logs\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)", "interval": "23.256322892s"}
2024-09-19T20:06:50.628Z info exporterhelper/retry_sender.go:118 Exporting failed. Will retry the request after interval. {"kind": "exporter", "data_type": "logs", "name": "otlphttp", "error": "failed to make an HTTP request: Post \"https://fluent-bit.morgan-certs.svc.cluster.local:4318/v1/logs\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)", "interval": "42.008956141s"}
2024-09-19T20:08:02.640Z error exporterhelper/queue_sender.go:92 Exporting failed. Dropping data. {"kind": "exporter", "data_type": "logs", "name": "otlphttp", "error": "no more retries left: failed to make an HTTP request: Post \"https://fluent-bit.morgan-certs.svc.cluster.local:4318/v1/logs\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)", "dropped_items": 10}
go.opentelemetry.io/collector/exporter/exporterhelper.newQueueSender.func1
go.opentelemetry.io/collector/exporter@v0.109.0/exporterhelper/queue_sender.go:92
go.opentelemetry.io/collector/exporter/internal/queue.(*boundedMemoryQueue[...]).Consume
go.opentelemetry.io/collector/exporter@v0.109.0/internal/queue/bounded_memory_queue.go:52
go.opentelemetry.io/collector/exporter/internal/queue.(*Consumers[...]).Start.func1
go.opentelemetry.io/collector/exporter@v0.109.0/internal/queue/consumers.go:43

It seems like these retries are resending the same logs to Fluent Bit, as duplicated logs appear in Fluent Bit’s stdout. This issue only occurs when using HTTPS/TLS. The same configuration works without any errors when using plain HTTP.

Steps to reproduce

  1. Set up Fluent Bit with the opentelemetry input plugin and enable TLS. fluent-bit.conf:

    [SERVICE]
        # Flush
        # =====
        # set an interval of seconds before to flush records to a destination
        flush        3
    
        # Daemon
        # ======
        # instruct Fluent Bit to run in foreground or background mode.
        daemon       Off
    
        # Log_Level
        # =========
        # Set the verbosity level of the service, values can be:
        #
        # - error
        # - warning
        # - info
        # - debug
        # - trace
        #
        # by default 'info' is set, that means it includes 'error' and 'warning'.
        log_level    info
    
        # Parsers File
        # ============
        # specify an optional 'Parsers' configuration file
        parsers_file parsers.conf
    
        # Plugins File
        # ============
        # specify an optional 'Plugins' configuration file to load external plugins.
        plugins_file plugins.conf
    
        # HTTP Server
        # ===========
        # Enable/Disable the built-in HTTP Server for metrics
        http_server  Off
        http_listen  0.0.0.0
        http_port    2020
    
        # Storage
        # =======
        # Fluent Bit can use memory and filesystem buffering based mechanisms
        #
        # - https://docs.fluentbit.io/manual/administration/buffering-and-storage
        #
        # storage metrics
        # ---------------
        # publish storage pipeline metrics in '/api/v1/storage'. The metrics are
        # exported only if the 'http_server' option is enabled.
        #
        storage.metrics on
    
        # storage.path
        # ------------
        # absolute file system path to store filesystem data buffers (chunks).
        #
        # storage.path /tmp/storage
    
        # storage.sync
        # ------------
        # configure the synchronization mode used to store the data into the
        # filesystem. It can take the values normal or full.
        #
        # storage.sync normal
    
        # storage.checksum
        # ----------------
        # enable the data integrity check when writing and reading data from the
        # filesystem. The storage layer uses the CRC32 algorithm.
        #
        # storage.checksum off
    
        # storage.backlog.mem_limit
        # -------------------------
        # if storage.path is set, Fluent Bit will look for data chunks that were
        # not delivered and are still in the storage layer, these are called
        # backlog data. This option configure a hint of maximum value of memory
        # to use when processing these records.
        #
        # storage.backlog.mem_limit 5M
    
    [INPUT]
        Name          opentelemetry
        Listen        0.0.0.0
        Port          4318
        tls           On
        tls.verify    On
        tls.ca_file   /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        tls.crt_file  /fluent-bit/tls/tls.crt
        tls.key_file  /fluent-bit/tls/tls.key
    [OUTPUT]
        Name   stdout
        Match  *
  2. Configure OpenTelemetry Collector to send logs over HTTPS to Fluent Bit using the OTLP HTTP exporter. config.yaml:

    extensions:
    
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
            tls:
              cert_file: /opt/certs/tls.crt
              key_file: /opt/certs/tls.key
              ca_file: /opt/certs/ca.crt
    
    processors:
      batch:
    
    exporters:
      otlphttp:
        endpoint: https://fluent-bit.morgan-certs.svc.cluster.local:4318
        tls:
          insecure: false
          insecure_skip_verify: true
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    
      debug:
        verbosity: detailed
        sampling_initial: 5
        sampling_thereafter: 200
    
    service:
      extensions: []
      pipelines:
        logs:
          receivers: [otlp]
          processors: [batch]
          exporters: [debug, otlphttp]
        metrics:
          receivers: [otlp]
          processors: [batch]
          exporters: [debug]
  3. Send logs from Open Liberty to Otel collector, using below configuration files:

    jvm.options: 
    -javaagent:/opt/ol/wlp/lib/opentelemetry-javaagent.jar
    server.env: 
    OTEL_SERVICE_NAME=testCustomOtel
    OTEL_EXPORTER_OTLP_ENDPOINT=https://otel.morgan-certs.svc.cluster.local:4317
    OTEL_EXPORTER_OTLP_PROTOCOL=grpc
    OTEL_TRACES_EXPORTER=none
    OTEL_METRICS_EXPORTER=none
    OTEL_EXPORTER_OTLP_CERTIFICATE=/opt/certs/ca.crt
    OTEL_METRIC_EXPORT_INTERVAL=10000
    server.xml: 
    <?xml version="1.0" encoding="UTF-8"?>
    <server description="new server">
        <!-- Enable features -->
        <featureManager>
            <feature>mpMetrics-3.0</feature>
            <feature>mpHealth-3.0</feature>
        </featureManager>
        <mpMetrics authentication="false" />
        <httpEndpoint id="defaultHttpEndpoint"
                      httpPort="9080"
                      httpsPort="9443"
                      host="*"/>
        <ssl id="defaultSSLConfig" trustDefaultCerts="true" />
    </server>

What did you expect to see? Logs should be exported from the OpenTelemetry Collector to Fluent Bit without any "context deadline exceeded (Client.Timeout exceeded while awaiting headers)" http errors or duplicate logs being sent, even when using TLS.

What did you see instead? The following error repeatedly appears in the OpenTelemetry Collector logs: exporterhelper/retry_sender.go:118 Exporting failed. Will retry the request after interval. {"kind": "exporter", "data_type": "logs", "name": "otlphttp", "error": "failed to make an HTTP request: Post \"https://fluent-bit.morgan-certs.svc.cluster.local:3418/v1/logs\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"

As a result, the same logs are resent multiple times to Fluent Bit, leading to duplicates in Fluent Bit’s stdout.

What version did you use? Open Telemetry Collector v0.105.0 Fluent Bit v3.1.8

What config did you use?

    extensions:

    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
            tls:
              cert_file: /opt/certs/tls.crt
              key_file: /opt/certs/tls.key
              ca_file: /opt/certs/ca.crt

    processors:
      batch:

    exporters:
      otlphttp:
        endpoint: https://fluent-bit.morgan-certs.svc.cluster.local:4318
        tls:
          insecure: false
          insecure_skip_verify: true
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt

      debug:
        verbosity: detailed
        sampling_initial: 5
        sampling_thereafter: 200

    service:
      extensions: []
      pipelines:
        logs:
          receivers: [otlp]
          processors: [batch]
          exporters: [debug, otlphttp]
        metrics:
          receivers: [otlp]
          processors: [batch]
          exporters: [debug]

Environment OpenShift kubernetes based environment v4.16.7

Additional context The OpenTelemetry Collector continues functioning after this error. The source of the logs is an Open Liberty webserver.

Morgan-Li commented 2 hours ago

I was able to send logs to fluent bit without errors by turning http/2 off on the fluent bit side http2 off. I'm thinking it might be a mismatch between otel and fluent bit http versions, but it seems to work when it defaults back to http1.1

[INPUT]
        Name          opentelemetry
        Listen        0.0.0.0
        Port          4318
        tls           On
        tls.verify    On
        tls.ca_file   /certs/fluent/tls/ca.crt
        tls.crt_file  /certs/fluent/tls/tls.crt
        tls.key_file  /certs/fluent/tls/tls.key
        tls.debug     4
        http2         Off
    [OUTPUT]
        Name   stdout
        Match  *