Open songy23 opened 5 months ago
Pinging code owners for testbed: @open-telemetry/collector-approvers. See Adding Labels via Comments if you do not have permissions to add labels yourself.
Pinging code owners:
testbed: @open-telemetry/collector-approvers
See Adding Labels via Comments if you do not have permissions to add labels yourself.
Looking into this a bit. Successful runs of this test take about 3 seconds (example).
From the panic stack trace, this one in particular looks interesting:
goroutine 1004 [semacquire, 9 minutes]:
sync.runtime_Semacquire(0xc005df65a0?)
/opt/hostedtoolcache/go/1.21.12/x64/src/runtime/sema.go:62 +0x25
sync.(*WaitGroup).Wait(0xc0035140a8?)
/opt/hostedtoolcache/go/1.21.12/x64/src/sync/waitgroup.go:116 +0x48
github.com/open-telemetry/opentelemetry-collector-contrib/testbed/testbed.(*inProcessCollector).Stop(0xc0015300e0)
/home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/testbed/testbed/in_process_collector.go:104 +0x4c
github.com/open-telemetry/opentelemetry-collector-contrib/testbed/testbed.(*TestCase).StopAgent(0xc003c3b180)
/home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/testbed/testbed/test_case.go:165 +0x24
github.com/open-telemetry/opentelemetry-collector-contrib/testbed/correctnesstests/traces.testWithTracingGoldenDataset(0x651cca0?, {0x7fe574eb2048, 0xc0017a46c0}, {0x4571c38, 0xc0066d8480}, {0x0?, 0x0?, 0xc0018f1760?, 0x498f6f?}, 0xc0016e55c0)
/home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/testbed/correctnesstests/traces/correctness_test.go:92 +0x712
github.com/open-telemetry/opentelemetry-collector-contrib/testbed/correctnesstests/traces.TestTracingGoldenData.func1(0xa6f12b?)
/home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/testbed/correctnesstests/traces/correctness_test.go:39 +0x4a
testing.tRunner(0xc002bde820, 0xc00380f788)
/opt/hostedtoolcache/go/1.21.12/x64/src/testing/testing.go:1595 +0xff
created by testing.(*T).Run in goroutine 7
/opt/hostedtoolcache/go/1.21.12/x64/src/testing/testing.go:1648 +0x3ad
It looks like the actual test functionality succeeds, but shutting down the in-process collector is blocking on this Wait() call. Since this is flaky, it looks like the waitgroup usage must have some kind of bug that is only hit intermittently.
Looking further, we see the collector is still running, which is blocking Done
from being called. The collector process itself is attempting to be shut down, but is blocked on reporting its status, attempting to get a lock on the reporter's mutex. We learned this from this stack trace:
goroutine 1008 [sync.Mutex.Lock, 9 minutes]:
sync.runtime_SemacquireMutex(0xc00279b908?, 0x85?, 0xc0013516b0?)
/opt/hostedtoolcache/go/1.21.12/x64/src/runtime/sema.go:77 +0x25
sync.(*Mutex).lockSlow(0xc001498810)
/opt/hostedtoolcache/go/1.21.12/x64/src/sync/mutex.go:171 +0x15d
sync.(*Mutex).Lock(...)
/opt/hostedtoolcache/go/1.21.12/x64/src/sync/mutex.go:90
go.opentelemetry.io/collector/service/internal/status.(*reporter).ReportStatus(0xc001498810, 0x4571c00?, 0x66d68a0?)
/home/runner/go/pkg/mod/go.opentelemetry.io/collector/service@v0.104.1-0.20240712081520-6227646b0146/internal/status/status.go:135 +0x4c
go.opentelemetry.io/collector/service/internal/graph.(*Graph).ShutdownAll(0xc001530700, {0x4571c00, 0x66d68a0}, {0x455bfb0, 0xc001498810})
/home/runner/go/pkg/mod/go.opentelemetry.io/collector/service@v0.104.1-0.20240712081520-6227646b0146/internal/graph/graph.go:469 +0x356
go.opentelemetry.io/collector/service.(*Service).Shutdown(0xc004030870, {0x4571c00, 0x66d68a0})
/home/runner/go/pkg/mod/go.opentelemetry.io/collector/service@v0.104.1-0.20240712081520-6227646b0146/service.go:267 +0xdf
go.opentelemetry.io/collector/otelcol.(*Collector).shutdown(0xc000f0bc20, {0x4571c00, 0x66d68a0})
/home/runner/go/pkg/mod/go.opentelemetry.io/collector/otelcol@v0.104.1-0.20240712081520-6227646b0146/collector.go:353 +0xc5
go.opentelemetry.io/collector/otelcol.(*Collector).Run(0xc000f0bc20, {0x4571c00, 0x66d68a0})
/home/runner/go/pkg/mod/go.opentelemetry.io/collector/otelcol@v0.104.1-0.20240712081520-6227646b0146/collector.go:339 +0x9b0
github.com/open-telemetry/opentelemetry-collector-contrib/testbed/testbed.(*inProcessCollector).Start.func2()
/home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/testbed/testbed/in_process_collector.go:80 +0x65
created by github.com/open-telemetry/opentelemetry-collector-contrib/testbed/testbed.(*inProcessCollector).Start in goroutine 1004
/home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/testbed/testbed/in_process_collector.go:78 +0x35d
We can then find the other goroutine that held the lock, which is the opencensus receiver. The receiver reports status during shutdown if the server is closed with an error message that is unexpected.
goroutine 1027 [chan send, 9 minutes]:
go.opentelemetry.io/collector/service.(*serviceHost).notifyComponentStatusChange(...)
/home/runner/go/pkg/mod/go.opentelemetry.io/collector/service@v0.104.1-0.20240712081520-6227646b0146/host.go:72
go.opentelemetry.io/collector/service/internal/status.(*reporter).componentFSM.func1(0x34eeec0?)
/home/runner/go/pkg/mod/go.opentelemetry.io/collector/service@v0.104.1-0.20240712081520-6227646b0146/internal/status/status.go:164 +0x25
go.opentelemetry.io/collector/service/internal/status.(*fsm).transition(0xc003e758c0, 0xc00422b740)
/home/runner/go/pkg/mod/go.opentelemetry.io/collector/service@v0.104.1-0.20240712081520-6227646b0146/internal/status/status.go:40 +0x8b
go.opentelemetry.io/collector/service/internal/status.(*reporter).ReportStatus(0xc001498810, 0x0?, 0xc0000b7f70?)
/home/runner/go/pkg/mod/go.opentelemetry.io/collector/service@v0.104.1-0.20240712081520-6227646b0146/internal/status/status.go:140 +0x99
go.opentelemetry.io/collector/service/internal/graph.(*Graph).buildComponents.TelemetrySettings.ToComponentTelemetrySettings.NewReportStatusFunc.func1(0x4531980?)
/home/runner/go/pkg/mod/go.opentelemetry.io/collector/service@v0.104.1-0.20240712081520-6227646b0146/internal/status/status.go:178 +0x24
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/opencensusreceiver.(*ocReceiver).Start.func3()
/home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/receiver/opencensusreceiver/opencensus.go:159 +0x119
created by github.com/open-telemetry/opentelemetry-collector-contrib/receiver/opencensusreceiver.(*ocReceiver).Start in goroutine 1008
/home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/receiver/opencensusreceiver/opencensus.go:156 +0xa59
Reporting status during shutdown is currently broken as filed in https://github.com/open-telemetry/opentelemetry-collector/issues/9824, this is a frequency of that bug.
Pinging code owners for receiver/opencensus: @open-telemetry/collector-approvers. See Adding Labels via Comments if you do not have permissions to add labels yourself.
This is the root cause blocking call: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/7a01297fdaa8db6c60011bf487c426c2bf570e42/receiver/opencensusreceiver/opencensus.go#L159
The next step is to investigate if we're simply missing an expected status that's returned when shutting down the receiver, or if there is actually a bug in shutting it down.
The referenced core issue is why this test is hitting timeouts, but during a successful test run the status shouldn't need to be reported, from what I can tell. So there may be an actual bug in the opencensus receiver that results in it not shutting down properly, or not properly handling error messages that should be expected.
This is the root cause blocking call:
The next step is to investigate if we're simply missing an expected status that's returned when shutting down the receiver, or if there is actually a bug in shutting it down.
The referenced core issue is why this test is hitting timeouts, but during a successful test run the status shouldn't need to be reported, from what I can tell. So there may be an actual bug in the opencensus receiver that results in it not shutting down properly, or not properly handling error messages that should be expected.
Agree with this, it might be missing the check for ErrListenerClosed
in the condition. There migh be a situation where the listener was already closed in the test. So the full if
statement should look like the following
if err := ocr.multiplexer.Serve(); !errors.Is(err, cmux.ErrServerClosed) && !errors.Is(err, cmux.ErrListenerClosed) && err != nil {
ocr.settings.TelemetrySettings.ReportStatus(component.NewFatalErrorEvent(err))
}
will open a PR and try to do few runs until we confirm it's resolved.
Still hitting this even after #34093 was merged: https://github.com/open-telemetry/opentelemetry-collector-contrib/actions/runs/10010459335/job/27671703924#step:7:1503
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers
. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself.
Component(s)
testbed
Describe the issue you're reporting
Seeing a lot of occurrences of this test timeout in CI:
E.g. https://github.com/open-telemetry/opentelemetry-collector-contrib/actions/runs/9761790955/job/26943836035?pr=33856#step:7:904
Note this is different from https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/27295, https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/27295 is due to the same port being binded twice, while this one seems to be due to
testbed.(*inProcessCollector).Stop
taking too long (deadlock?)