open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.07k stars 2.37k forks source link

[extension/jaegerremotesampling] Flaky test - TestStartAndShutdownRemote #9113

Closed jpkrohling closed 1 year ago

jpkrohling commented 2 years ago

Logs:

2022-04-06T19:52:18.7017926Z make -C ./extension/jaegerremotesampling test
2022-04-06T19:52:18.7035598Z make[2]: Entering directory '/home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/extension/jaegerremotesampling'
2022-04-06T19:52:18.7089756Z go test -race -v -timeout 300s --tags="" ./...
2022-04-06T19:52:19.8199960Z === RUN   TestLoadConfig
2022-04-06T19:52:19.8201021Z --- PASS: TestLoadConfig (0.01s)
2022-04-06T19:52:19.8201622Z === RUN   TestValidate
2022-04-06T19:52:19.8202491Z === RUN   TestValidate/no_receiving_protocols
2022-04-06T19:52:19.8202884Z === RUN   TestValidate/no_sources
2022-04-06T19:52:19.8203215Z === RUN   TestValidate/too_many_sources
2022-04-06T19:52:19.8203777Z --- PASS: TestValidate (0.00s)
2022-04-06T19:52:19.8204379Z     --- PASS: TestValidate/no_receiving_protocols (0.00s)
2022-04-06T19:52:19.8204827Z     --- PASS: TestValidate/no_sources (0.00s)
2022-04-06T19:52:19.8205274Z     --- PASS: TestValidate/too_many_sources (0.00s)
2022-04-06T19:52:19.8205573Z === RUN   TestNewExtension
2022-04-06T19:52:19.8205908Z --- PASS: TestNewExtension (0.00s)
2022-04-06T19:52:19.8206458Z === RUN   TestStartAndShutdownLocalFile
2022-04-06T19:52:19.8206880Z --- PASS: TestStartAndShutdownLocalFile (0.00s)
2022-04-06T19:52:19.8207233Z === RUN   TestStartAndShutdownRemote
2022-04-06T19:52:19.8207674Z 2022/04/06 19:52:19 FATAL: [core] grpc: Server.RegisterService after Server.Serve for "jaeger.api_v2.SamplingManager"
2022-04-06T19:52:19.8208354Z FAIL   github.com/open-telemetry/opentelemetry-collector-contrib/extension/jaegerremotesampling    0.035s
2022-04-06T19:52:19.8208813Z === RUN   TestMissingClientConfigManager
2022-04-06T19:52:19.8209241Z --- PASS: TestMissingClientConfigManager (0.00s)
2022-04-06T19:52:19.8209560Z === RUN   TestStartAndStop
2022-04-06T19:52:19.8209898Z --- PASS: TestStartAndStop (0.00s)
2022-04-06T19:52:19.8210192Z === RUN   TestEndpointsAreWired
2022-04-06T19:52:19.8210490Z === RUN   TestEndpointsAreWired/legacy
2022-04-06T19:52:19.8210798Z === RUN   TestEndpointsAreWired/new
2022-04-06T19:52:19.8211167Z --- PASS: TestEndpointsAreWired (0.00s)
2022-04-06T19:52:19.8211626Z     --- PASS: TestEndpointsAreWired/legacy (0.00s)
2022-04-06T19:52:19.8212094Z     --- PASS: TestEndpointsAreWired/new (0.00s)
2022-04-06T19:52:19.8212416Z === RUN   TestServiceNameIsRequired
2022-04-06T19:52:19.8212804Z --- PASS: TestServiceNameIsRequired (0.00s)
2022-04-06T19:52:19.8213149Z === RUN   TestErrorFromClientConfigManager
2022-04-06T19:52:19.8213583Z --- PASS: TestErrorFromClientConfigManager (0.00s)
2022-04-06T19:52:19.8214155Z PASS
2022-04-06T19:52:19.8214712Z ok     github.com/open-telemetry/opentelemetry-collector-contrib/extension/jaegerremotesampling/internal   (cached)
2022-04-06T19:52:19.8215086Z FAIL
2022-04-06T19:52:19.8328315Z make[2]: *** [../../Makefile.Common:46: test] Error 1
2022-04-06T19:52:19.8332324Z make[1]: *** [Makefile:162: for-all-target-./extension/jaegerremotesampling] Error 2
2022-04-06T19:52:19.8333144Z make[2]: Leaving directory '/home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/extension/jaegerremotesampling'
2022-04-06T19:52:19.8336794Z make[1]: Leaving directory '/home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib'
2022-04-06T19:52:19.8337572Z make: *** [Makefile:72: gotest] Error 2
2022-04-06T19:52:19.8359336Z ##[error]Process completed with exit code 2.
jpkrohling commented 2 years ago

And here: https://github.com/open-telemetry/opentelemetry-collector-contrib/runs/5887191395?check_suite_focus=true

pmm-sumo commented 2 years ago

Here as well: https://github.com/open-telemetry/opentelemetry-collector-contrib/runs/5972401892?check_suite_focus=true

djaglowski commented 2 years ago

This failed again here: https://github.com/open-telemetry/opentelemetry-collector-contrib/runs/7957402089?check_suite_focus=true#step:8:940

=== RUN   TestStartAndShutdownRemote
    extension_test.go:83: 
            Error Trace:    /home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/extension/jaegerremotesampling/extension_test.go:83
            Error:          Received unexpected error:
                            error while starting the gRPC server: failed to listen on gRPC port: listen tcp :14250: bind: address already in use
            Test:           TestStartAndShutdownRemote
jpkrohling commented 2 years ago

cc @frzifus, I think I pinged you on the wrong issue. I wanted to ping you on this one here. Sorry about the noise.

frzifus commented 2 years ago

Sure, i can take over :)

frzifus commented 2 years ago

i wonder that it fails in line 83. https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/ea97fda88d979b29b31885bd8dc9f116ef983d5e/extension/jaegerremotesampling/extension_test.go#L83

Since the port is already allocated in line 58. https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/ea97fda88d979b29b31885bd8dc9f116ef983d5e/extension/jaegerremotesampling/extension_test.go#L57-L58

I changed localhost:0 to localhost:8080 while blocking that port with nc -l -p 8080. Then i see:

=== RUN   TestStartAndShutdownRemote
    extension_test.go:58: 
            Error Trace:    /home/frzifus/git/go/observability/opentelemetry-collector-contrib/extension/jaegerremotesampling/extension_test.go:58
            Error:          Received unexpected error:
                            listen tcp 127.0.0.1:8080: bind: address already in use
            Test:           TestStartAndShutdownRemote
--- FAIL: TestStartAndShutdownRemote (0.00s)
FAIL
exit status 1

If you take a closer look at the extention start method, only a connection to the server is established but no port is allocated.

What do you think about adding a retry mechanism to the test evaluation?

djaglowski commented 2 years ago

Again here: https://github.com/open-telemetry/opentelemetry-collector-contrib/actions/runs/3197849469/jobs/5221638347#step:8:1027

github-actions[bot] commented 2 years ago

Pinging code owners: @jpkrohling. See Adding Labels via Comments if you do not have permissions to add labels yourself.

jpkrohling commented 2 years ago

What do you think about adding a retry mechanism to the test evaluation?

That might be worth trying, but this failure pattern happens elsewhere. I think there's an underlying problem we need to solve, I just don't know yet which problem this is. Without digging too much into it, I would guess this is related to how we run the tests (with -race and some parallel test executions).

github-actions[bot] commented 1 year ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

atoulme commented 1 year ago

We just hit this again: https://github.com/open-telemetry/opentelemetry-collector-contrib/actions/runs/3851828150/jobs/6563425908

=== RUN   TestStartAndShutdownRemote
    extension_test.go:83: 
            Error Trace:    /home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/extension/jaegerremotesampling/extension_test.go:83
            Error:          Received unexpected error:
                            error while starting the gRPC server: failed to listen on gRPC port: listen tcp 127.0.0.1:14250: bind: address already in use
            Test:           TestStartAndShutdownRemote
--- FAIL: TestStartAndShutdownRemote (0.00s)
atoulme commented 1 year ago

Just hit this again:

=== RUN   TestStartAndShutdownRemote
    extension_test.go:83: 
            Error Trace:    /home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/extension/jaegerremotesampling/extension_test.go:83
            Error:          Received unexpected error:
                            error while starting the gRPC server: failed to listen on gRPC port: listen tcp 127.0.0.1:14250: bind: address already in use
            Test:           TestStartAndShutdownRemote

https://github.com/open-telemetry/opentelemetry-collector-contrib/actions/runs/4301204205/jobs/7498184609

atoulme commented 1 year ago

another occurrence: https://github.com/open-telemetry/opentelemetry-collector-contrib/actions/runs/4378772863/jobs/7663958484

lewis262626 commented 1 year ago

I have experienced the same in https://github.com/open-telemetry/opentelemetry-collector-contrib/actions/runs/4387063396/jobs/7685532139#step:9:520