open-telemetry / opentelemetry-collector

OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
4.36k stars 1.45k forks source link

Flaky metrics tests after #6297, never seen this before #6323

Closed bogdandrutu closed 1 year ago

bogdandrutu commented 2 years ago

See https://github.com/open-telemetry/opentelemetry-collector/actions/runs/3255902434/jobs/5345701291

paivagustavo commented 2 years ago

This is not related to #6297, I've tracked down the same test error a month ago: commit and ci, there should be other occurrences but stopped here.

But since I started to debug this, I managed to discover that the collector integration test always try to register zpages to the same address (127.0.0.1:55679) instead of using the helper testutil.GetAvailableLocalAddress(t) to get an available address. Occasionally, this port is not be available, possibly because of this, ending the test with the error:

failed to start extensions: listen tcp 127.0.0.1:55679: bind: An attempt was made to access a socket in a way forbidden by its access permissions.

The following tests that fails with a different view with the same name is already registered is because we don't clean up the telemetryInitializer when the collector.service.Start(ctx) errors, in this case because the zpage extension was not started properly. Similar to https://github.com/open-telemetry/opentelemetry-collector/issues/6238.

This can lead to other resource leaks because we don't clean up any resource that has been correctly started. For example, if an error happens on the NotifyPipelineReady function, all started extensions and pipelines would not be shutdown.