Closed SeanPMiller closed 2 weeks ago
As a workaround, I have switched to using the following function:
func temporaryDryRunPatch(settings otelcol.CollectorSettings, ctxt context.Context) (err error) {
factories, err := settings.Factories()
if err != nil {
return fmt.Errorf("failed to initialize factories: %w", err)
}
confProv, err := otelcol.NewConfigProvider(settings.ConfigProviderSettings)
if err != nil {
return fmt.Errorf("failed to instantiate configuration provider: %w", err)
}
defer func() {
if shutdownErr := confProv.Shutdown(ctxt); shutdownErr != nil {
if err != nil {
err = fmt.Errorf("configuration-provider shutdown failed after encountering error: shutdown_err=%w, original_err=%w", shutdownErr, err)
} else {
err = fmt.Errorf("configuration-provider shutdown failed: %w", shutdownErr)
}
}
}()
cfg, err := confProv.Get(ctxt, factories)
if err != nil {
return fmt.Errorf("failed to get config: %w", err)
}
return cfg.Validate()
}
This gives me the behavior I want, so this is not an especially high-priority issue.
Thanks for reporting! Seems like this was caused by #9257 cc @djaglowski @ycombinator
I don't quite understand why it happens though, my expectation would be that until we call Service.Start
no extension actually registers anything. @SeanPMiller, two questions for you:
I'll work on a demonstration program. Having a day job is a drag.
https://github.com/SeanPMiller/otelcol-issue-10031
This is essentially condensed from otelcorecol, except for the custom root command.
Specific answers:
2024-04-25T18:19:24.987Z info service@v0.98.0/telemetry.go:55 Setting up own telemetry...
2024-04-25T18:19:24.987Z info service@v0.98.0/telemetry.go:97 Serving metrics {"address": ":8888", "level": "Basic"}
2024-04-25T18:19:24.990Z info service@v0.98.0/telemetry.go:55 Setting up own telemetry...
2024-04-25T18:19:24.990Z info service@v0.98.0/telemetry.go:97 Serving metrics {"address": ":8888", "level": "Basic"}
Thanks for the repro! I see how this can fail with the Prometheus endpoint, I am still a bit puzzled with the healthcheck case.
I am, too. I'll see if I can reproduce, but it only seems to happen with a sufficiently complex configuration.
I think this issue in contrib around validate
is related. The kubeletstatreceiver's createMetricsReceiver
function is being invoked when validate
is used.
Perhaps we should revert https://github.com/open-telemetry/opentelemetry-collector/pull/9257 and reopen https://github.com/open-telemetry/opentelemetry-collector/issues/8007 for now, so we can take another stab at implementing validation in a different way?
The kubeletstatreceiver's createMetricsReceiver function is being invoked when validate is used.
That's to be expected, isn't it? The problem there IMO is that a component should not be opening any files during creation, only during startup.
Perhaps we should revert https://github.com/open-telemetry/opentelemetry-collector/pull/9257 and reopen https://github.com/open-telemetry/opentelemetry-collector/issues/8007 for now, so we can take another stab at implementing validation in a different way?
Let's do this. It also seems like we have not set clear expectations on what components should do at creation time vs startup, so we should work on this as well.
That's to be expected, isn't it? The problem there IMO is that a component should not be opening any files during creation, only during startup.
That may be true, and it is possible the k8s components have a bug, but the reason the user in that issue starting seeing an issue when using validate is because in https://github.com/open-telemetry/opentelemetry-collector/pull/9257 instead of only calling cfg.Validate
we also now instantiate the components via validatePipelineCfg
.
I agree with reverting the change for now
I'm unable to simply click the "Revert" button on https://github.com/open-telemetry/opentelemetry-collector/pull/9257 so instead I've created https://github.com/open-telemetry/opentelemetry-collector/pull/10078 manually reverting the relevant bits of that PR.
Describe the bug This behavior might be intended. However, if you write your own entrypoint into the collector, as I have, and you call
otelcol.Collector.DryRun()
with a goal of determining whether the configuration "looks valid" before callingRun()
, thenservice.New()
is now called as part of the dry run, and its return value is discarded. This behavior is a consequence of this commit.Due to the above behavior, calling
DryRun()
now, for example, callsopen()
andbind()
for the self-monitoring telemetry-service port and the health-check extension port, among other ports. WhenRun()
is subsequently called, the second instance of each server component fails with abind()
error because its port has already been opened and bound.This might be a misuse of the
DryRun()
function. Let me know, and thanks in advance.Steps to reproduce Write the above code.
What did you expect to see? The old behavior.
What did you see instead?
What version did you use? v0.98.0
Environment go version go1.22.0 linux/amd64