open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.04k stars 2.35k forks source link

[exporter/clickhouse] Exporter fails to startup if Clickhouse is unavailable #34771

Open jsirianni opened 2 months ago

jsirianni commented 2 months ago

Component(s)

exporter/clickhouse

What happened?

Collector fails to start when clickhouse is unavialable.

Description

If Clickhouse is unavailable during collector startup, the exporter will log an error and exit. This causes an issue where other components will also fail to run because the collector has exited.

In a distributed system, Clickhouse may be unavailable briefly, during collector startup. You could rely on the system to restart the collector (systemd, docker, Kubernetes), however, some systems such as OpAMP will roll the collector back to its previous configuration instead of restarting over and over until it works.

Steps to Reproduce

Create config.yaml using the configuration attached to this issue.

Build the collector main branch with make otelcontribcol.

Run the collector. Example:

bin/otelcontribcol_darwin_arm64 --config ./config.yaml

Observe that the collector process exits with code 1.

Expected Result

Due to the nature of distributed systems, the exporter should not depend on Clickhouse being available. It should log the issue and retry the connection with a backoff.

I think having a new option enabling this behavior would be reasonable.

Actual Result

The collector logs an error and exits 1.

Collector version

main 8db9320e7b

Environment information

Environment

OS: All Compiler(if manually compiled): Any, Go 1.23

OpenTelemetry Collector configuration

receivers:
  otlp:
    protocols:
     grpc:

exporters:
  clickhouse:
    endpoint: tcp://127.0.0.1:9000?dial_timeout=10s
    database: otel
    logs_table_name: otel_logs
    traces_table_name: otel_traces
    metrics_table_name: otel_metrics
    timeout: 5s

service:
  pipelines:
    metrics:
      receivers:
        - otlp
      exporters:
        - clickhouse

Log output

2024-08-20T14:05:31.426-0400    info    service@v0.107.1-0.20240816132030-9fd84668bb02/service.go:114   Setting up own telemetry...
2024-08-20T14:05:31.427-0400    info    service@v0.107.1-0.20240816132030-9fd84668bb02/telemetry.go:98  Serving metrics {"address": ":8888", "metrics level": "Normal"}
2024-08-20T14:05:31.427-0400    info    service@v0.107.1-0.20240816132030-9fd84668bb02/service.go:193   Starting otelcontribcol...  {"Version": "0.107.0-dev", "NumCPU": 12}
2024-08-20T14:05:31.427-0400    info    extensions/extensions.go:37 Starting extensions...
2024-08-20T14:05:31.427-0400    error   graph/graph.go:430  Failed to start component   {"error": "create database: dial tcp 127.0.0.1:9000: connect: connection refused", "type": "Exporter", "id": "clickhouse"}
2024-08-20T14:05:31.427-0400    info    service@v0.107.1-0.20240816132030-9fd84668bb02/service.go:256   Starting shutdown...
2024-08-20T14:05:31.428-0400    info    extensions/extensions.go:64 Stopping extensions...
2024-08-20T14:05:31.428-0400    info    service@v0.107.1-0.20240816132030-9fd84668bb02/service.go:270   Shutdown complete.
Error: cannot start pipelines: create database: dial tcp 127.0.0.1:9000: connect: connection refused
2024/08/20 14:05:32 collector server run finished with error: cannot start pipelines: create database: dial tcp 127.0.0.1:9000: connect: connection refused

Additional context

The OTLP exporter behaves the way I am describing. It will not fail startup, instead, it will start queueing and retrying.

receivers:
  hostmetrics:
    scrapers:
      load:

exporters:
  otlp:
    endpoint: otelcol2:4317

service:
  pipelines:
    metrics:
      receivers:
        - hostmetrics
      exporters:
        - otlp
2024-08-20T14:12:12.845-0400    info    service@v0.107.1-0.20240816132030-9fd84668bb02/service.go:114   Setting up own telemetry...
2024-08-20T14:12:12.845-0400    info    service@v0.107.1-0.20240816132030-9fd84668bb02/telemetry.go:98  Serving metrics {"address": ":8888", "metrics level": "Normal"}
2024-08-20T14:12:12.845-0400    info    service@v0.107.1-0.20240816132030-9fd84668bb02/service.go:193   Starting otelcontribcol...  {"Version": "0.107.0-dev", "NumCPU": 12}
2024-08-20T14:12:12.845-0400    info    extensions/extensions.go:37 Starting extensions...
2024-08-20T14:12:12.845-0400    info    service@v0.107.1-0.20240816132030-9fd84668bb02/service.go:219   Everything is ready. Begin running and processing data.
2024-08-20T14:12:12.845-0400    info    localhostgate/featuregate.go:63 The default endpoints for all servers in components have changed to use localhost instead of 0.0.0.0. Disable the feature gate to temporarily revert to the previous default.   {"feature gate ID": "component.UseLocalHostAsDefaultHost"}
2024-08-20T14:12:13.906-0400    info    exporterhelper/retry_sender.go:118  Exporting failed. Will retry the request after interval.    {"kind": "exporter", "data_type": "metrics", "name": "otlp", "error": "rpc error: code = Unavailable desc = name resolver error: produced zero addresses", "interval": "5.602816972s"}
^C2024-08-20T14:12:15.433-0400  info    otelcol@v0.107.1-0.20240816132030-9fd84668bb02/collector.go:318 Received signal from OS {"signal": "interrupt"}
2024-08-20T14:12:15.433-0400    info    service@v0.107.1-0.20240816132030-9fd84668bb02/service.go:256   Starting shutdown...
2024-08-20T14:12:15.434-0400    error   exporterhelper/queue_sender.go:92   Exporting failed. Dropping data.    {"kind": "exporter", "data_type": "metrics", "name": "otlp", "error": "interrupted due to shutdown: rpc error: code = Unavailable desc = name resolver error: produced zero addresses", "dropped_items": 3}
github-actions[bot] commented 2 months ago

Pinging code owners:

jsirianni commented 1 month ago

@dmitryax do yo have any thoughts? I can provide a contributor for the actual change.

SpencerTorres commented 1 month ago

Apologies for the late response, it appears my comment didn't get posted:

When the exporter starts it tries to create some tables that are required for running. If this initial step fails then it makes sense the exporter would exit.

I agree it should retry before failing, but also consider that this may lead to dropped data if the exporter never connects.

jsirianni commented 1 month ago

Other exporters will retry until they are successful by utilizing the collector's retry and sending queue ability. I do see how this exporter is a bit more complicated due to the table creation requirements.

This could cause issues in a scenario where the collector is configured with multiple exporters. A Clickhouse outage could mean a collector outage, even when the other pipelines were functional.

SpencerTorres commented 1 month ago

Good point, I wouldn't want to affect the other pipelines. I'll see if I can add some retry logic 👍