open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.05k stars 2.36k forks source link

Integration tests consistently fail on first run #34495

Closed hughesjj closed 2 weeks ago

hughesjj commented 2 months ago

Component(s)

ci/cd

What happened?

Description

Integration tests don't fail the parent workflow if an integration test fails

Steps to Reproduce

Example job

At least two tests in this spot check fail but the workflow succeeds. Search jmx in the build logs, output is below though.

Expected Result

I expect the failing test to fail the workflow

Actual Result

image

Collector version

latest (main)

Environment information

No response

OpenTelemetry Collector configuration

No response

Log output

DONE 42 tests in 6.975s
Running target 'mod-integration-test' in module 'receiver/jmxreceiver' as part of group 'receiver-1'
make --no-print-directory -C receiver/jmxreceiver mod-integration-test
go: downloading go.opentelemetry.io/collector/receiver/otlpreceiver v0.106.1
running go integration test ./... in /home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/receiver/jmxreceiver
/home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/.tools/gotestsum --rerun-fails=1 --packages="./..." -- -race -timeout 360s -parallel 4 -tags=integration,""
∅  internal/metadata
✖  internal/subprocess (776ms)
✖  . (28.327s)

However, on rerun, it passes more or less immediately

∅  internal/metadata
✖  internal/subprocess (769ms)
✖  . (33.44s)

DONE 51 tests, 2 failures in 34.640s

✓  . (cached)
✓  internal/subprocess (cached)

DONE 2 runs, 51 tests in 35.369s

another receiver:


Running target 'mod-integration-test' in module 'receiver/googlecloudspannerreceiver' as part of group 'receiver-1'
make --no-print-directory -C receiver/googlecloudspannerreceiver mod-integration-test
go: downloading cloud.google.com/go/spanner v1.65.0
go: downloading github.com/mitchellh/hashstructure v1.1.0
go: downloading github.com/ReneKroon/ttlcache/v2 v2.11.0
go: downloading github.com/GoogleCloudPlatform/grpc-gcp-go/grpcgcp v1.5.0
go: downloading github.com/envoyproxy/go-control-plane v0.12.0
go: downloading github.com/cncf/xds/go v0.0.0-20240423153145-555b57ec207b
go: downloading github.com/envoyproxy/protoc-gen-validate v1.0.4
go: downloading cel.dev/expr v0.15.0
go: downloading github.com/census-instrumentation/opencensus-proto v0.4.1
running go integration test ./... in /home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/receiver/googlecloudspannerreceiver
/home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/.tools/gotestsum --rerun-fails=1 --packages="./..." -- -race -timeout 360s -parallel 4 -tags=integration,""
✖  internal/filter (567ms)
✓  internal/filterfactory (1.069s)
✓  internal/datasource (1.183s)
✓  internal/metadata (1.129s)
✓  internal/metadataconfig (1.105s)
✓  . (3.911s)
✓  internal/metadataparser (1.064s)
✓  internal/statsreader (1.193s)

DONE 288 tests, 1 skipped, 1 failure in 23.004s


### Additional context

_No response_
hughesjj commented 2 months ago

/label ci-cd

odubajDT commented 2 months ago

I am gonna look into this issue. Seems like the gotestsum used for running unit and also integration tests is not propagating the exit code of go test which is ran in the background, since gotestsum is just a wrapper around it. I am gonna investigate the possible options here

hughesjj commented 2 months ago

@odubajDT I'll assign this to you, feel free to assign to me if you don't have the time to take it up I don't have assignment permissions, sorry

Adding some useful links for context..

Thoughts on adding a canary to integration tests? (ex replace a random file from the above integration targets with /dev/random or a known failing file and see if the whole workflow fails?)

hughesjj commented 2 months ago

Hey @odubajDT I'd like to take a stab at this tomorrow if you don't have a fix/root cause for it yet, wanted to give you the opportunity to share anything before I jump the gun

odubajDT commented 2 months ago

Hey @odubajDT I'd like to take a stab at this tomorrow if you don't have a fix/root cause for it yet, wanted to give you the opportunity to share anything before I jump the gun

Hey @hughesjj sorry for not responding, I was offline for a week. Sure take a look if you are interested

odubajDT commented 2 months ago

From what I can see is that the tests actually passes, therefore the status is reported correctly

running go integration test ./... in /home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/receiver/jmxreceiver
/home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/.tools/gotestsum --rerun-fails=1 --packages="./..." -- -race -timeout 360s -parallel 4 -tags=integration,""
∅  internal/metadata
✖  internal/subprocess (776ms)
✖  . (28.327s)

DONE 51 tests, 2 failures in 30.608s

✓  . (1.047s)
✓  internal/subprocess (1.01s)

DONE 2 runs, 51 tests in 34.243s

As we can see, the second run of the internal/subprocess passes, therefore the correct status is reported.

The test is re-run due to the presence of --rerun-fails=1 parameter in the gotestsum.

This PR showcases it https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/34729

Also one additional (not related) issue I spotted, seems like that in certain cases unit tests are run as part of the integration tests test-run (and maybe vice-versa).

See that internal/filter test in receiver-1 is run in integration-tests and unit-tests as well. This is due to that tests are not correctly tagged.

hughesjj commented 2 months ago

No worries, I jinxed myself by saying I'd have time to do something 😅

Just sharing info from my side:

You're 100% correct. It's interesting how consistently this fails (at least for the jmx cases), and super annoying when using goland for integ tests -- they fail every time with the "default" run configuration, so breakpoints get annoying.

For jmx in particular, I was getting some complaints about a ryuk container not terminating in goland.


I'll rename this issue to reflect the current understanding of things and see about removing the p1 label

odubajDT commented 2 months ago

No worries, I jinxed myself by saying I'd have time to do something 😅

Just sharing info from my side:

You're 100% correct. It's interesting how consistently this fails (at least for the jmx cases), and super annoying when using goland for integ tests -- they fail every time with the "default" run configuration, so breakpoints get annoying.

For jmx in particular, I was getting some complaints about a ryuk container not terminating in goland.

I'll rename this issue to reflect the current understanding of things and see about removing the p1 label

Thanks! The PR should be ready for review if you have time.