open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
2.96k stars 2.3k forks source link

Flaky test: TestTracingGoldenData/otlp-opencensus timed out after 10m #33865

Open songy23 opened 3 months ago

songy23 commented 3 months ago

Component(s)

testbed

Describe the issue you're reporting

Seeing a lot of occurrences of this test timeout in CI:

panic: test timed out after 10m0s
running tests:
    TestTracingGoldenData (10m0s)
    TestTracingGoldenData/opencensus-otlp (9m30s)

E.g. https://github.com/open-telemetry/opentelemetry-collector-contrib/actions/runs/9761790955/job/26943836035?pr=33856#step:7:904

Note this is different from https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/27295, https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/27295 is due to the same port being binded twice, while this one seems to be due to testbed.(*inProcessCollector).Stop taking too long (deadlock?)

github-actions[bot] commented 3 months ago

Pinging code owners for testbed: @open-telemetry/collector-approvers. See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] commented 3 months ago

Pinging code owners:

crobert-1 commented 2 months ago

+1 freq: https://github.com/open-telemetry/opentelemetry-collector-contrib/actions/runs/9944862348/job/27471861996?pr=34083

crobert-1 commented 2 months ago

Looking into this a bit. Successful runs of this test take about 3 seconds (example).

From the panic stack trace, this one in particular looks interesting:

goroutine 1004 [semacquire, 9 minutes]:
sync.runtime_Semacquire(0xc005df65a0?)
    /opt/hostedtoolcache/go/1.21.12/x64/src/runtime/sema.go:62 +0x25
sync.(*WaitGroup).Wait(0xc0035140a8?)
    /opt/hostedtoolcache/go/1.21.12/x64/src/sync/waitgroup.go:116 +0x48
github.com/open-telemetry/opentelemetry-collector-contrib/testbed/testbed.(*inProcessCollector).Stop(0xc0015300e0)
    /home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/testbed/testbed/in_process_collector.go:104 +0x4c
github.com/open-telemetry/opentelemetry-collector-contrib/testbed/testbed.(*TestCase).StopAgent(0xc003c3b180)
    /home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/testbed/testbed/test_case.go:165 +0x24
github.com/open-telemetry/opentelemetry-collector-contrib/testbed/correctnesstests/traces.testWithTracingGoldenDataset(0x651cca0?, {0x7fe574eb2048, 0xc0017a46c0}, {0x4571c38, 0xc0066d8480}, {0x0?, 0x0?, 0xc0018f1760?, 0x498f6f?}, 0xc0016e55c0)
    /home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/testbed/correctnesstests/traces/correctness_test.go:92 +0x712
github.com/open-telemetry/opentelemetry-collector-contrib/testbed/correctnesstests/traces.TestTracingGoldenData.func1(0xa6f12b?)
    /home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/testbed/correctnesstests/traces/correctness_test.go:39 +0x4a
testing.tRunner(0xc002bde820, 0xc00380f788)
    /opt/hostedtoolcache/go/1.21.12/x64/src/testing/testing.go:1595 +0xff
created by testing.(*T).Run in goroutine 7
    /opt/hostedtoolcache/go/1.21.12/x64/src/testing/testing.go:1648 +0x3ad

It looks like the actual test functionality succeeds, but shutting down the in-process collector is blocking on this Wait() call. Since this is flaky, it looks like the waitgroup usage must have some kind of bug that is only hit intermittently.

Looking further, we see the collector is still running, which is blocking Done from being called. The collector process itself is attempting to be shut down, but is blocked on reporting its status, attempting to get a lock on the reporter's mutex. We learned this from this stack trace:

goroutine 1008 [sync.Mutex.Lock, 9 minutes]:
sync.runtime_SemacquireMutex(0xc00279b908?, 0x85?, 0xc0013516b0?)
    /opt/hostedtoolcache/go/1.21.12/x64/src/runtime/sema.go:77 +0x25
sync.(*Mutex).lockSlow(0xc001498810)
    /opt/hostedtoolcache/go/1.21.12/x64/src/sync/mutex.go:171 +0x15d
sync.(*Mutex).Lock(...)
    /opt/hostedtoolcache/go/1.21.12/x64/src/sync/mutex.go:90
go.opentelemetry.io/collector/service/internal/status.(*reporter).ReportStatus(0xc001498810, 0x4571c00?, 0x66d68a0?)
    /home/runner/go/pkg/mod/go.opentelemetry.io/collector/service@v0.104.1-0.20240712081520-6227646b0146/internal/status/status.go:135 +0x4c
go.opentelemetry.io/collector/service/internal/graph.(*Graph).ShutdownAll(0xc001530700, {0x4571c00, 0x66d68a0}, {0x455bfb0, 0xc001498810})
    /home/runner/go/pkg/mod/go.opentelemetry.io/collector/service@v0.104.1-0.20240712081520-6227646b0146/internal/graph/graph.go:469 +0x356
go.opentelemetry.io/collector/service.(*Service).Shutdown(0xc004030870, {0x4571c00, 0x66d68a0})
    /home/runner/go/pkg/mod/go.opentelemetry.io/collector/service@v0.104.1-0.20240712081520-6227646b0146/service.go:267 +0xdf
go.opentelemetry.io/collector/otelcol.(*Collector).shutdown(0xc000f0bc20, {0x4571c00, 0x66d68a0})
    /home/runner/go/pkg/mod/go.opentelemetry.io/collector/otelcol@v0.104.1-0.20240712081520-6227646b0146/collector.go:353 +0xc5
go.opentelemetry.io/collector/otelcol.(*Collector).Run(0xc000f0bc20, {0x4571c00, 0x66d68a0})
    /home/runner/go/pkg/mod/go.opentelemetry.io/collector/otelcol@v0.104.1-0.20240712081520-6227646b0146/collector.go:339 +0x9b0
github.com/open-telemetry/opentelemetry-collector-contrib/testbed/testbed.(*inProcessCollector).Start.func2()
    /home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/testbed/testbed/in_process_collector.go:80 +0x65
created by github.com/open-telemetry/opentelemetry-collector-contrib/testbed/testbed.(*inProcessCollector).Start in goroutine 1004
    /home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/testbed/testbed/in_process_collector.go:78 +0x35d

We can then find the other goroutine that held the lock, which is the opencensus receiver. The receiver reports status during shutdown if the server is closed with an error message that is unexpected.

goroutine 1027 [chan send, 9 minutes]:
go.opentelemetry.io/collector/service.(*serviceHost).notifyComponentStatusChange(...)
    /home/runner/go/pkg/mod/go.opentelemetry.io/collector/service@v0.104.1-0.20240712081520-6227646b0146/host.go:72
go.opentelemetry.io/collector/service/internal/status.(*reporter).componentFSM.func1(0x34eeec0?)
    /home/runner/go/pkg/mod/go.opentelemetry.io/collector/service@v0.104.1-0.20240712081520-6227646b0146/internal/status/status.go:164 +0x25
go.opentelemetry.io/collector/service/internal/status.(*fsm).transition(0xc003e758c0, 0xc00422b740)
    /home/runner/go/pkg/mod/go.opentelemetry.io/collector/service@v0.104.1-0.20240712081520-6227646b0146/internal/status/status.go:40 +0x8b
go.opentelemetry.io/collector/service/internal/status.(*reporter).ReportStatus(0xc001498810, 0x0?, 0xc0000b7f70?)
    /home/runner/go/pkg/mod/go.opentelemetry.io/collector/service@v0.104.1-0.20240712081520-6227646b0146/internal/status/status.go:140 +0x99
go.opentelemetry.io/collector/service/internal/graph.(*Graph).buildComponents.TelemetrySettings.ToComponentTelemetrySettings.NewReportStatusFunc.func1(0x4531980?)
    /home/runner/go/pkg/mod/go.opentelemetry.io/collector/service@v0.104.1-0.20240712081520-6227646b0146/internal/status/status.go:178 +0x24
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/opencensusreceiver.(*ocReceiver).Start.func3()
    /home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/receiver/opencensusreceiver/opencensus.go:159 +0x119
created by github.com/open-telemetry/opentelemetry-collector-contrib/receiver/opencensusreceiver.(*ocReceiver).Start in goroutine 1008
    /home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/receiver/opencensusreceiver/opencensus.go:156 +0xa59

Reporting status during shutdown is currently broken as filed in https://github.com/open-telemetry/opentelemetry-collector/issues/9824, this is a frequency of that bug.

github-actions[bot] commented 2 months ago

Pinging code owners for receiver/opencensus: @open-telemetry/collector-approvers. See Adding Labels via Comments if you do not have permissions to add labels yourself.

crobert-1 commented 2 months ago

This is the root cause blocking call: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/7a01297fdaa8db6c60011bf487c426c2bf570e42/receiver/opencensusreceiver/opencensus.go#L159

The next step is to investigate if we're simply missing an expected status that's returned when shutting down the receiver, or if there is actually a bug in shutting it down.

The referenced core issue is why this test is hitting timeouts, but during a successful test run the status shouldn't need to be reported, from what I can tell. So there may be an actual bug in the opencensus receiver that results in it not shutting down properly, or not properly handling error messages that should be expected.

odubajDT commented 2 months ago

This is the root cause blocking call:

https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/7a01297fdaa8db6c60011bf487c426c2bf570e42/receiver/opencensusreceiver/opencensus.go#L159

The next step is to investigate if we're simply missing an expected status that's returned when shutting down the receiver, or if there is actually a bug in shutting it down.

The referenced core issue is why this test is hitting timeouts, but during a successful test run the status shouldn't need to be reported, from what I can tell. So there may be an actual bug in the opencensus receiver that results in it not shutting down properly, or not properly handling error messages that should be expected.

Agree with this, it might be missing the check for ErrListenerClosed in the condition. There migh be a situation where the listener was already closed in the test. So the full if statement should look like the following

if err := ocr.multiplexer.Serve(); !errors.Is(err, cmux.ErrServerClosed) && !errors.Is(err, cmux.ErrListenerClosed) && err != nil {
            ocr.settings.TelemetrySettings.ReportStatus(component.NewFatalErrorEvent(err))
        }

will open a PR and try to do few runs until we confirm it's resolved.

crobert-1 commented 2 months ago

+1 freq: https://github.com/open-telemetry/opentelemetry-collector-contrib/actions/runs/9960091743/job/27518737021?pr=34083

crobert-1 commented 2 months ago

Still hitting this even after #34093 was merged: https://github.com/open-telemetry/opentelemetry-collector-contrib/actions/runs/10010459335/job/27671703924#step:7:1503

crobert-1 commented 1 month ago

+1 freq: https://github.com/open-telemetry/opentelemetry-collector-contrib/actions/runs/10269832050/job/28416207324?pr=34422