Open verejoel opened 3 months ago
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself.
I have tried a few different configurations, all with the same outcome -> telemetry is blocked if one endpoint is down. Let me flesh out our environment.
All of our logs get shipped to Loki by default, but some of the data needs to also be shipped to additional endpoints (Kafka topics and Azure Eventhub). Call our tenants foo
, bar
, and baz
, and our endpoints loki
, kafka/foo
, kafka/bar
, kafka/baz
, and kafka/eventhub
. Note that each "endpoint" is actually an OTel collector.
Our collector setup is the following:
ingress gateway -> router -> backend specific collectors
So all our logs are shipped to one endpoint, then forwarded to a routing stage, before being split off into backend specific collectors.
The problem we have is that the inability of any single one of our backend collectors blocks the entire telemetry pipeline.
I have tried a few different concepts for the routing stage. I have tried routing to tenant-specific pipelines, and routing to backend-specific pipelines. Examples of the config for each case below:
# backend-specific routing
connectors:
routing:
default_pipelines:
- logs/loki
error_mode: ignore
table:
- pipelines:
- logs/eventhub
- logs/kafka/foo
- logs/loki
statement: route() where attributes["service_component"] == "foo"
- pipelines:
- logs/eventhub
- logs/kafka/bar
- logs/loki
statement: route() where attributes["service_component"] == "bar"
- pipelines:
- logs/eventhub
- logs/kafka/baz
- logs/loki
statement: route() where attributes["service_component"] == "baz"
exporters:
otlp/eventhub:
endpoint: otel-eventhub-distributor-collector:4317
sending_queue:
enabled: false
tls:
insecure: true
otlp/kafka/foo:
endpoint: otel-kafka-foo-collector:4317
sending_queue:
enabled: false
tls:
insecure: true
otlp/kafka/bar:
endpoint: otel-kafka-bar-collector:4317
sending_queue:
enabled: false
tls:
insecure: true
otlp/kafka/proxy:
endpoint: otel-kafka-proxy-collector:4317
sending_queue:
enabled: false
tls:
insecure: true
otlp/loki:
endpoint: otel-backend-loki-collector:4317
sending_queue:
enabled: false
tls:
insecure: true
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
service:
pipelines:
logs/incoming:
exporters:
- routing
processors:
- memory_limiter
receivers:
- otlp
logs/eventhub:
exporters:
- otlp/eventhub
receivers:
- routing
logs/kafka/foo:
exporters:
- otlp/kafka/foo
receivers:
- routing
logs/kafka/bar:
exporters:
- otlp/kafka/bar
receivers:
- routing
logs/kafka/baz:
exporters:
- otlp/kafka/baz
receivers:
- routing
logs/loki:
exporters:
- otlp/loki
receivers:
- routing
And tenant-specific routing:
# tenant-specific routing
connectors:
routing:
default_pipelines:
- logs/default
error_mode: ignore
table:
- pipelines:
- logs/foo
statement: route() where attributes["service_component"] == "foo"
- pipelines:
- logs/bar
statement: route() where attributes["service_component"] == "bar"
- pipelines:
- logs/baz
statement: route() where attributes["service_component"] == "baz"
exporters:
otlp/eventhub:
endpoint: otel-eventhub-distributor-collector:4317
sending_queue:
enabled: false
tls:
insecure: true
otlp/kafka/foo:
endpoint: otel-kafka-foo-collector:4317
sending_queue:
enabled: false
tls:
insecure: true
otlp/kafka/bar:
endpoint: otel-kafka-bar-collector:4317
sending_queue:
enabled: false
tls:
insecure: true
otlp/kafka/baz:
endpoint: otel-kafka-baz-collector:4317
sending_queue:
enabled: false
tls:
insecure: true
otlp/loki:
endpoint: otel-backend-loki-collector:4317
sending_queue:
enabled: false
tls:
insecure: true
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
service:
pipelines:
logs/incoming:
exporters:
- routing
processors:
- memory_limiter
receivers:
- otlp
logs/foo
exporters:
- otlp/eventhub
- otlp/kafka/foo
- otlp/loki
receivers:
- routing
logs/bar
exporters:
- otlp/eventhub
- otlp/kafka/bar
- otlp/loki
receivers:
- routing
logs/baz
exporters:
- otlp/eventhub
- otlp/kafka/baz
- otlp/loki
receivers:
- routing
logs/default:
exporters:
- otlp/loki
receivers:
- routing
Both situations are vulnerable in case one of the otlp exporters cannot ship data.
As suggested by @jpkrohling I tested using the forward connector and filtering. This didn't work either (same behaviour, one dead pipeline kills them all).
I think it’s quite hard to decouple pipelines in the collector, it seems to be baked in at a very low level…
The fanout consumer seems to be used when you have receivers / exporters used in multiple pipelines. And this guy runs synchronously, so one failure blocks everything. I think this is the root cause, it’s not actually something routing connector specific.
Can we use the exporter helper to specify whether a given exporter should be considered "blocking" or not? Or would making the fanoutconsumer asynchronous help?
Having thought some more about this. Here's where I am:
So I think the work here needs to happen in the exporter helper, and we need to optionally shard retries / sending queue by incoming context.
So I think the work here needs to happen in the exporter helper
I think that's what we are going towards. See https://github.com/open-telemetry/opentelemetry-collector/issues/8122
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers
. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself.
Component(s)
connector/routing
What happened?
Description
We have a use case where we want to route telemetry to different collectors based on a resource attribute. For this, we use the
routing
connector. We have observed that if one of the endpoints is unavailable, the entire pipeline will be blocked.Steps to Reproduce
A very minimal example:
If either the
otlp/bar
orotlp/foo
endpoints are down, no data will be received on the other endpoint. Effectively, one endpoint outage can cause the entire pipeline to go dark.Expected Result
I would expect that the
routing
connector should forward data to all healthy pipelines, and not block all routing in case of one unhealthy pipeline.Actual Result
A single unhealthy pipeline blocks delivery of all telemetry.
Collector version
0.95.0 (custom build)
Environment information
Environment
Kubernetes 1.28
OpenTelemetry Collector configuration
No response
Log output
No response
Additional context
No response