open-telemetry / opentelemetry-collector

OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.94k stars 1.32k forks source link

otelcol.Collector.DryRun() now instantiates all components #10031

Closed SeanPMiller closed 2 weeks ago

SeanPMiller commented 3 weeks ago

Describe the bug This behavior might be intended. However, if you write your own entrypoint into the collector, as I have, and you call otelcol.Collector.DryRun() with a goal of determining whether the configuration "looks valid" before calling Run(), then service.New() is now called as part of the dry run, and its return value is discarded. This behavior is a consequence of this commit.

Due to the above behavior, calling DryRun() now, for example, calls open() and bind() for the self-monitoring telemetry-service port and the health-check extension port, among other ports. When Run() is subsequently called, the second instance of each server component fails with a bind() error because its port has already been opened and bound.

This might be a misuse of the DryRun() function. Let me know, and thanks in advance.

Steps to reproduce Write the above code.

What did you expect to see? The old behavior.

What did you see instead?

Error: failed to start extensions: failed to bind to address 0.0.0.0:13133: listen tcp 0.0.0.0:13133: bind: address already in use; failed to shutdown pipelines: no existing monitoring routine is running
2024/04/24 22:03:22 collector server run finished with error: failed to start extensions: failed to bind to address 0.0.0.0:13133: listen tcp 0.0.0.0:13133: bind: address already in use; failed to shutdown pipelines: no existing monitoring routine is running

What version did you use? v0.98.0

Environment go version go1.22.0 linux/amd64

SeanPMiller commented 3 weeks ago

As a workaround, I have switched to using the following function:

func temporaryDryRunPatch(settings otelcol.CollectorSettings, ctxt context.Context) (err error) {
    factories, err := settings.Factories()
    if err != nil {
        return fmt.Errorf("failed to initialize factories: %w", err)
    }   

    confProv, err := otelcol.NewConfigProvider(settings.ConfigProviderSettings)
    if err != nil {
        return fmt.Errorf("failed to instantiate configuration provider: %w", err)
    }   
    defer func() {
        if shutdownErr := confProv.Shutdown(ctxt); shutdownErr != nil {
            if err != nil {
                err = fmt.Errorf("configuration-provider shutdown failed after encountering error: shutdown_err=%w, original_err=%w", shutdownErr, err)
            } else {
                err = fmt.Errorf("configuration-provider shutdown failed: %w", shutdownErr)
            }   
        }   
    }() 

    cfg, err := confProv.Get(ctxt, factories)
    if err != nil {
        return fmt.Errorf("failed to get config: %w", err)
    }   

    return cfg.Validate()
}

This gives me the behavior I want, so this is not an especially high-priority issue.

mx-psi commented 3 weeks ago

Thanks for reporting! Seems like this was caused by #9257 cc @djaglowski @ycombinator

I don't quite understand why it happens though, my expectation would be that until we call Service.Start no extension actually registers anything. @SeanPMiller, two questions for you:

  1. Does this only happen with the healthcheck extension? Or can you see this happening with other components?
  2. Can you confirm that this does not happen in v0.97.0?
SeanPMiller commented 3 weeks ago

I'll work on a demonstration program. Having a day job is a drag.

SeanPMiller commented 3 weeks ago

https://github.com/SeanPMiller/otelcol-issue-10031

This is essentially condensed from otelcorecol, except for the custom root command.

SeanPMiller commented 3 weeks ago

Specific answers:

  1. I see this in production with both the selfmon Prometheus telemetry service and also with healthcheck extension. The log shows, for example, the following:
    2024-04-25T18:19:24.987Z    info    service@v0.98.0/telemetry.go:55 Setting up own telemetry...
    2024-04-25T18:19:24.987Z    info    service@v0.98.0/telemetry.go:97 Serving metrics {"address": ":8888", "level": "Basic"}
    2024-04-25T18:19:24.990Z    info    service@v0.98.0/telemetry.go:55 Setting up own telemetry...
    2024-04-25T18:19:24.990Z    info    service@v0.98.0/telemetry.go:97 Serving metrics {"address": ":8888", "level": "Basic"}
  2. I can confirm that this does not happen in v0.97.0, as demonstrated by the reproduction above.
mx-psi commented 3 weeks ago

Thanks for the repro! I see how this can fail with the Prometheus endpoint, I am still a bit puzzled with the healthcheck case.

SeanPMiller commented 3 weeks ago

I am, too. I'll see if I can reproduce, but it only seems to happen with a sufficiently complex configuration.

TylerHelmuth commented 3 weeks ago

I think this issue in contrib around validate is related. The kubeletstatreceiver's createMetricsReceiver function is being invoked when validate is used.

ycombinator commented 2 weeks ago

Perhaps we should revert https://github.com/open-telemetry/opentelemetry-collector/pull/9257 and reopen https://github.com/open-telemetry/opentelemetry-collector/issues/8007 for now, so we can take another stab at implementing validation in a different way?

mx-psi commented 2 weeks ago

The kubeletstatreceiver's createMetricsReceiver function is being invoked when validate is used.

That's to be expected, isn't it? The problem there IMO is that a component should not be opening any files during creation, only during startup.

Perhaps we should revert https://github.com/open-telemetry/opentelemetry-collector/pull/9257 and reopen https://github.com/open-telemetry/opentelemetry-collector/issues/8007 for now, so we can take another stab at implementing validation in a different way?

Let's do this. It also seems like we have not set clear expectations on what components should do at creation time vs startup, so we should work on this as well.

TylerHelmuth commented 2 weeks ago

That's to be expected, isn't it? The problem there IMO is that a component should not be opening any files during creation, only during startup.

That may be true, and it is possible the k8s components have a bug, but the reason the user in that issue starting seeing an issue when using validate is because in https://github.com/open-telemetry/opentelemetry-collector/pull/9257 instead of only calling cfg.Validate we also now instantiate the components via validatePipelineCfg.

TylerHelmuth commented 2 weeks ago

I agree with reverting the change for now

ycombinator commented 2 weeks ago

I'm unable to simply click the "Revert" button on https://github.com/open-telemetry/opentelemetry-collector/pull/9257 so instead I've created https://github.com/open-telemetry/opentelemetry-collector/pull/10078 manually reverting the relevant bits of that PR.