Open jsirianni opened 2 months ago
Pinging code owners:
exporter/clickhouse: @hanjm @dmitryax @Frapschen @SpencerTorres
See Adding Labels via Comments if you do not have permissions to add labels yourself.
@dmitryax do yo have any thoughts? I can provide a contributor for the actual change.
Apologies for the late response, it appears my comment didn't get posted:
When the exporter starts it tries to create some tables that are required for running. If this initial step fails then it makes sense the exporter would exit.
I agree it should retry before failing, but also consider that this may lead to dropped data if the exporter never connects.
Other exporters will retry until they are successful by utilizing the collector's retry and sending queue ability. I do see how this exporter is a bit more complicated due to the table creation requirements.
This could cause issues in a scenario where the collector is configured with multiple exporters. A Clickhouse outage could mean a collector outage, even when the other pipelines were functional.
Good point, I wouldn't want to affect the other pipelines. I'll see if I can add some retry logic 👍
Component(s)
exporter/clickhouse
What happened?
Collector fails to start when clickhouse is unavialable.
Description
If Clickhouse is unavailable during collector startup, the exporter will log an error and exit. This causes an issue where other components will also fail to run because the collector has exited.
In a distributed system, Clickhouse may be unavailable briefly, during collector startup. You could rely on the system to restart the collector (systemd, docker, Kubernetes), however, some systems such as OpAMP will roll the collector back to its previous configuration instead of restarting over and over until it works.
Steps to Reproduce
Create
config.yaml
using the configuration attached to this issue.Build the collector main branch with
make otelcontribcol
.Run the collector. Example:
Observe that the collector process exits with code
1
.Expected Result
Due to the nature of distributed systems, the exporter should not depend on Clickhouse being available. It should log the issue and retry the connection with a backoff.
I think having a new option enabling this behavior would be reasonable.
Actual Result
The collector logs an error and exits 1.
Collector version
main
8db9320e7b
Environment information
Environment
OS: All Compiler(if manually compiled): Any, Go 1.23
OpenTelemetry Collector configuration
Log output
Additional context
The OTLP exporter behaves the way I am describing. It will not fail startup, instead, it will start queueing and retrying.