open-telemetry / otel-arrow

Protocol and libraries for sending and receiving OpenTelemetry data using Apache Arrow
Apache License 2.0
87 stars 16 forks source link

stream terminated by RST_STREAM with error code: PROTOCOL_ERROR #268

Closed igorestevanjasinski closed 2 weeks ago

igorestevanjasinski commented 3 weeks ago

I'm using otel arrow to send logs from eks to on-premises via ingress. I have 1 collector as agent in eks with arrow exporter and another collector as gateway on-premises with otel arrow, the logs are being send, but I'm geeting a lot of errors at the first collector.

Error messages: 2024-10-29T21:23:55.068Z error arrow/stream.go:154 arrow stream error {"kind": "exporter", "data_type": "logs", "name": "otelarrow", "code": 14, "message": "closing transport due to: connection error: desc = \"error reading from server: EOF\", received prior goaway: code: NO_ERROR", "where": "reader"} github.com/open-telemetry/opentelemetry-collector-contrib/exporter/otelarrowexporter/internal/arrow.(*Stream).logStreamError github.com/open-telemetry/opentelemetry-collector-contrib/exporter/otelarrowexporter@v0.110.0/internal/arrow/stream.go:154 github.com/open-telemetry/opentelemetry-collector-contrib/exporter/otelarrowexporter/internal/arrow.(*Stream).run github.com/open-telemetry/opentelemetry-collector-contrib/exporter/otelarrowexporter@v0.110.0/internal/arrow/stream.go:223 github.com/open-telemetry/opentelemetry-collector-contrib/exporter/otelarrowexporter/internal/arrow.(*Exporter).runArrowStream github.com/open-telemetry/opentelemetry-collector-contrib/exporter/otelarrowexporter@v0.110.0/internal/arrow/exporter.go:250 2024-10-29T21:24:30.146Z error arrow/stream.go:154 arrow stream error {"kind": "exporter", "data_type": "logs", "name": "otelarrow", "code": 13, "message": "stream terminated by RST_STREAM with error code: PROTOCOL_ERROR", "where": "reader"} github.com/open-telemetry/opentelemetry-collector-contrib/exporter/otelarrowexporter/internal/arrow.(*Stream).logStreamError github.com/open-telemetry/opentelemetry-collector-contrib/exporter/otelarrowexporter@v0.110.0/internal/arrow/stream.go:154 github.com/open-telemetry/opentelemetry-collector-contrib/exporter/otelarrowexporter/internal/arrow.(*Stream).run github.com/open-telemetry/opentelemetry-collector-contrib/exporter/otelarrowexporter@v0.110.0/internal/arrow/stream.go:223 github.com/open-telemetry/opentelemetry-collector-contrib/exporter/otelarrowexporter/internal/arrow.(*Exporter).runArrowStream github.com/open-telemetry/opentelemetry-collector-contrib/exporter/otelarrowexporter@v0.110.0/internal/arrow/exporter.go:250

Agent collector config:

`   receivers:
      otlp:
        protocols:
          http:
            endpoint: 0.0.0.0:4318 
          grpc:
            endpoint: 0.0.0.0:4317

    processors:
      batch:
        send_batch_size: 2000
        send_batch_max_size: 2500

    exporters:
      debug:
        verbosity: detailed
      otelarrow:
        endpoint: https://otel-collector-logs-arrow.dev.sicredi.cloud
        tls:
          insecure: true
        wait_for_ready: true

    extensions:
      zpages:
      health_check:

    service:
      extensions: [zpages, health_check]
      pipelines:
        logs:
          receivers: [otlp]
          processors: [batch]
          exporters: [otelarrow, debug]

` Gateway collector config:

`    receivers:
      otelarrow:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4320
      otlp:
        protocols:
          http:
            endpoint: 0.0.0.0:4318 
          grpc:
            endpoint: 0.0.0.0:4317

    processors:
      batch:
      memory_limiter:
        limit_percentage: 80
        spike_limit_percentage: 25
        check_interval: 15s

    exporters:  
      debug:
        verbosity: detailed

      loki:
        endpoint: "https://grafana-loki.dev.sicredi.cloud/loki/api/v1/push"
        headers:
        tls:
          insecure: true
        default_labels_enabled:
          exporter: true
          job: true   

    extensions:
      zpages:
      health_check:
    service:
      extensions: [zpages, health_check]
      pipelines:
        logs:
          receivers: [otelarrow,otlp]
          processors: [memory_limiter, batch]
          exporters: [debug,loki]
`

Any tips how to fix it ? or maybe improve my config?

lquerel commented 3 weeks ago

@jmacd any idea regarding this issue?

jmacd commented 3 weeks ago

Is the endpoint https://otel-collector-logs-arrow.dev.sicredi.cloud backed by a HTTP/2 load balancer?

I've seen this error in development, and usually there is an intermediate proxy responsible for breaking the connection. I had a dialog with the gRPC-Go team about this, because it seems there is not a way for a streaming client to be notified "gracefully" when the connection is broken. The load balancer might have a 10 minute connection limit, for example.

The best resolution we could find was to add a voluntary lifetime parameter to the client, to let it recycle connections before a load balancer abruptly disconnects. The setting is otelarrow::arrow::max_stream_lifetime. In the version you are running, this configuration was set much too large, and you are I believe seeing the consequences.

In https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/35478, we adjusted the default max_stream_lifetime to 30 seconds (down from 1 hour!) based on experimental results. I believe you could just upgrade to the newest release, or you could set this field to eliminate the error.

I'll be happy to help if this advice doesn't resolve the problem. It will help to know the details of the load balancer. Thank you!

igorestevanjasinski commented 3 weeks ago

I've updated my collector to v0.112.0, and the error is gone. Thank you everyone.