[connector/routing] Outage of one endpoint blocks entire pipeline

verejoel commented 3 months ago

Component(s)

connector/routing

What happened?

Description

We have a use case where we want to route telemetry to different collectors based on a resource attribute. For this, we use the routing connector. We have observed that if one of the endpoints is unavailable, the entire pipeline will be blocked.

Steps to Reproduce

A very minimal example:

    connectors:
      routing:
        default_pipelines:
        - logs/foo
        error_mode: ignore
        table:
        - pipelines:
          - logs/bar
          statement: route() where attributes["service_component"] == "bar"

    exporters:
      otlp/foo:
        endpoint: otel-foo-collector:4317
        sending_queue:
          enabled: false
        tls:
          insecure: true
      otlp/bar:
        endpoint: otel-bar-collector:4317
        sending_queue:
          enabled: false
        tls:
          insecure: true

    service:
      pipelines:
        logs/foo:
          exporters:
          - otlp/foo
          receivers:
          - routing

        logs/bar:
          exporters:
          - otlp/bar
          receivers:
          - routing

If either the otlp/bar or otlp/foo endpoints are down, no data will be received on the other endpoint. Effectively, one endpoint outage can cause the entire pipeline to go dark.

Expected Result

I would expect that the routing connector should forward data to all healthy pipelines, and not block all routing in case of one unhealthy pipeline.

Actual Result

A single unhealthy pipeline blocks delivery of all telemetry.

Collector version

0.95.0 (custom build)

Environment information

Environment

Kubernetes 1.28

OpenTelemetry Collector configuration

No response

Log output

No response

Additional context

No response

github-actions[bot] commented 3 months ago

Pinging code owners:

connector/routing: @jpkrohling @mwear

See Adding Labels via Comments if you do not have permissions to add labels yourself.

verejoel commented 3 months ago

I have tried a few different configurations, all with the same outcome -> telemetry is blocked if one endpoint is down. Let me flesh out our environment.

All of our logs get shipped to Loki by default, but some of the data needs to also be shipped to additional endpoints (Kafka topics and Azure Eventhub). Call our tenants foo, bar, and baz, and our endpoints loki, kafka/foo, kafka/bar, kafka/baz, and kafka/eventhub. Note that each "endpoint" is actually an OTel collector.

Our collector setup is the following:

ingress gateway -> router -> backend specific collectors

So all our logs are shipped to one endpoint, then forwarded to a routing stage, before being split off into backend specific collectors.

The problem we have is that the inability of any single one of our backend collectors blocks the entire telemetry pipeline.

I have tried a few different concepts for the routing stage. I have tried routing to tenant-specific pipelines, and routing to backend-specific pipelines. Examples of the config for each case below:

# backend-specific routing
    connectors:
      routing:
        default_pipelines:
        - logs/loki
        error_mode: ignore
        table:
        - pipelines:
          - logs/eventhub
          - logs/kafka/foo
          - logs/loki
          statement: route() where attributes["service_component"] == "foo"
        - pipelines:
          - logs/eventhub
          - logs/kafka/bar
          - logs/loki
          statement: route() where attributes["service_component"] == "bar"
        - pipelines:
          - logs/eventhub
          - logs/kafka/baz
          - logs/loki
          statement: route() where attributes["service_component"] == "baz"
    exporters:
      otlp/eventhub:
        endpoint: otel-eventhub-distributor-collector:4317
        sending_queue:
          enabled: false
        tls:
          insecure: true
      otlp/kafka/foo:
        endpoint: otel-kafka-foo-collector:4317
        sending_queue:
          enabled: false
        tls:
          insecure: true
      otlp/kafka/bar:
        endpoint: otel-kafka-bar-collector:4317
        sending_queue:
          enabled: false
        tls:
          insecure: true
      otlp/kafka/proxy:
        endpoint: otel-kafka-proxy-collector:4317
        sending_queue:
          enabled: false
        tls:
          insecure: true
      otlp/loki:
        endpoint: otel-backend-loki-collector:4317
        sending_queue:
          enabled: false
        tls:
          insecure: true
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
    service:
      pipelines:
        logs/incoming:
          exporters:
          - routing
          processors:
          - memory_limiter
          receivers:
          - otlp
        logs/eventhub:
          exporters:
          - otlp/eventhub
          receivers:
          - routing
        logs/kafka/foo:
          exporters:
          - otlp/kafka/foo
          receivers:
          - routing
        logs/kafka/bar:
          exporters:
          - otlp/kafka/bar
          receivers:
          - routing
        logs/kafka/baz:
          exporters:
          - otlp/kafka/baz
          receivers:
          - routing
        logs/loki:
          exporters:
          - otlp/loki
          receivers:
          - routing

And tenant-specific routing:

# tenant-specific routing
    connectors:
      routing:
        default_pipelines:
        - logs/default
        error_mode: ignore
        table:
        - pipelines:
          - logs/foo
          statement: route() where attributes["service_component"] == "foo"
        - pipelines:
          - logs/bar
          statement: route() where attributes["service_component"] == "bar"
        - pipelines:
          - logs/baz
          statement: route() where attributes["service_component"] == "baz"
    exporters:
      otlp/eventhub:
        endpoint: otel-eventhub-distributor-collector:4317
        sending_queue:
          enabled: false
        tls:
          insecure: true
      otlp/kafka/foo:
        endpoint: otel-kafka-foo-collector:4317
        sending_queue:
          enabled: false
        tls:
          insecure: true
      otlp/kafka/bar:
        endpoint: otel-kafka-bar-collector:4317
        sending_queue:
          enabled: false
        tls:
          insecure: true
      otlp/kafka/baz:
        endpoint: otel-kafka-baz-collector:4317
        sending_queue:
          enabled: false
        tls:
          insecure: true
      otlp/loki:
        endpoint: otel-backend-loki-collector:4317
        sending_queue:
          enabled: false
        tls:
          insecure: true
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
    service:
      pipelines:
        logs/incoming:
          exporters:
          - routing
          processors:
          - memory_limiter
          receivers:
          - otlp
        logs/foo
          exporters:
          - otlp/eventhub
          - otlp/kafka/foo
          - otlp/loki
          receivers:
          - routing
        logs/bar
          exporters:
          - otlp/eventhub
          - otlp/kafka/bar
          - otlp/loki
          receivers:
          - routing
        logs/baz
          exporters:
          - otlp/eventhub
          - otlp/kafka/baz
          - otlp/loki
          receivers:
          - routing
        logs/default:
          exporters:
          - otlp/loki
          receivers:
          - routing

Both situations are vulnerable in case one of the otlp exporters cannot ship data.

verejoel commented 3 months ago

As suggested by @jpkrohling I tested using the forward connector and filtering. This didn't work either (same behaviour, one dead pipeline kills them all).

I think it’s quite hard to decouple pipelines in the collector, it seems to be baked in at a very low level…

The fanout consumer seems to be used when you have receivers / exporters used in multiple pipelines. And this guy runs synchronously, so one failure blocks everything. I think this is the root cause, it’s not actually something routing connector specific.

Can we use the exporter helper to specify whether a given exporter should be considered "blocking" or not? Or would making the fanoutconsumer asynchronous help?

verejoel commented 3 months ago

Having thought some more about this. Here's where I am:

it is correct that backpressure should be applied to the entire pipeline in case one endpoint is down
it would be useful to be able to prioritize telemetry within a pipeline and apply backpressure to specific tenants rather than to all telemetry
currently we could do this with separate contexts per-tenant. However, as of right now the sending_queue seems to be the blocker, as it is not sharded by incoming context

So I think the work here needs to happen in the exporter helper, and we need to optionally shard retries / sending queue by incoming context.

jpkrohling commented 2 months ago

So I think the work here needs to happen in the exporter helper

I think that's what we are going towards. See https://github.com/open-telemetry/opentelemetry-collector/issues/8122

github-actions[bot] commented 5 days ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

connector/routing: @jpkrohling @mwear

See Adding Labels via Comments if you do not have permissions to add labels yourself.

open-telemetry / opentelemetry-collector-contrib