Open omar711 opened 1 year ago
One more here: https://app.circleci.com/pipelines/github/kurtosis-tech/kurtosis/7874/workflows/63979657-73a6-4919-b27b-b6fc9d0814b2/jobs/107163
This time not add service, but TestStartosisReplaceTestSuite
:
TestStartosisReplaceTestSuite/TestStartosisReplaceWithModuleInDirectory (17.40s)
suite.go:87: test panicked: runtime error: invalid memory address or nil pointer dereference
also looks random, and possibly something being torn down prematurely
suite.go:87: test panicked: runtime error: invalid memory address or nil pointer dereference
looks like a plain bug: our code being wrong, nothing to do with CI.
I really, really, really think we should prioritize digging out what the cause is and fix these... flaky tests erode the trust in the CI across the entire team, and with merge queues flaky tests are only going exacerbate the problem. Our CI test suite didn't used to be flaky, and in the past ~6-9 months, it's gotten increasingly so.
Cool found the issue in the tests themselves (excuse my go learning adventure):
func (suite *StartosisReplaceTestSuite) TestStartosisReplaceWithModuleInDirectory() {
ctx := context.Background()
runResult, _ := suite.RunPackageWithParams(ctx, packageWithReplaceModuleInDirectoryRelPath, packageWithReplaceModuleInDirectoryParams)
t := suite.T()
require.Nil(t, runResult.InterpretationError)
require.Empty(t, runResult.ValidationErrors)
require.Nil(t, runResult.ExecutionError)
expectedResult := "Replace with module in directory sample package loaded.\nVerification succeeded. Value is '\"another-dependency-loaded-from-internal-module-in-main-branch\"'.\n"
require.Regexp(t, expectedResult, string(runResult.RunOutput))
}
The error at suite.RunPackageWithParams...
is ignored, which will cause the null pointer issue at the first require. Tested this locally by overriding its return values, e.g.:
runResult, err := suite.RunPackageWithParams(ctx, packageWithReplaceModuleInDirectoryRelPath, packageWithReplaceModuleInDirectoryParams)
runResult = nil
err = fmt.Errorf("oh dear")
...
Then after that adding a require.NoError(t, err)
will reveal the actual error that's occurring. I'll try to find other instances of this and apply the same fix into a PR.
That will only show us the real errors, and it only affects 1 of the 3 cases above, from what I can see. There's still something weird in the test infra / CI (long running tests, maybe things timing out, etc)
https://github.com/kurtosis-tech/kurtosis/pull/1559 contains the fix for this one, plus similar tests
One more random failure (occurred locally with ./internal_testsuites/scripts/test.sh
, re-running fixing it):
--- FAIL: TestIdentifiers (1.33s)
enclave_and_service_identifiers_test.go:56:
Error Trace: /Users/omar/kurtosis/kurtosis/internal_testsuites/golang/testsuite/enclave_and_service_identifiers_test/enclave_and_service_identifiers_test.go:56
Error: Received unexpected error:
An error occurred adding the datastore service
--- at /Users/omar/kurtosis/kurtosis/internal_testsuites/golang/test_helpers/test_helpers.go:220 (AddDatastoreService) ---
Caused by: An error has occurred when running Starlark to add service
--- at /Users/omar/kurtosis/kurtosis/internal_testsuites/golang/test_helpers/test_helpers.go:162 (AddService) ---
Caused by: Error requesting Starlark Script run
--- at /Users/omar/kurtosis/kurtosis/api/golang/core/lib/enclaves/enclave_context.go:119 (EnclaveContext.RunStarlarkScriptBlocking) ---
Caused by: Unexpected error happened executing Kurtosis script.
--- at /Users/omar/kurtosis/kurtosis/api/golang/core/lib/enclaves/enclave_context.go:104 (EnclaveContext.RunStarlarkScript) ---
Caused by: rpc error: code = Unavailable desc = connection error: desc = "error reading server preface: http2: frame too large"
Test: TestIdentifiers
Messages: An error occurred adding the datastore service to the enclave
FAIL
FAIL github.com/kurtosis-tech/kurtosis-cli/golang_internal_testsuite/testsuite/enclave_and_service_identifiers_test 1.491s
In relation to the panic error https://app.circleci.com/pipelines/github/kurtosis-tech/kurtosis/7874/workflows/63979657-73a6-4919-b27b-b6fc9d0814b2/jobs/107163 , I've just noticed you can see in the gateway
log:
INFO[2023-10-13T15:43:31Z] Forwarding requests to local port number '39373' to remote port with id 'grpc'
INFO[2023-10-13T15:43:31Z] Closing connection to pod: https://0.0.0.0:40395/api/v1/namespaces/kt-idle-enclave-3eed8f5bc86342ee9def0471466a9c3a/pods/kurtosis-api/portforward
WARN[2023-10-13T15:43:31Z] Expected to run api container gateway until stopped, but the server exited prematurely with a non-nil error: 'Expected to run API container gateway server until stopped, but the server exited with a non-nil error
--- at /root/project/cli/cli/kurtosis_gateway/run/api_container_gateway/run_until_stopped.go:48 (RunApiContainerGatewayUntilStopped) ---
Caused by: An error occurred creating the listener on tcp/39373
--- at /go/pkg/mod/github.com/kurtosis-tech/minimal-grpc-server/golang@v0.0.0-20230710164206-90b674acb269/server/server.go:111 (MinimalGRPCServer.RunUntilStopped) ---
Caused by: listen tcp :39373: bind: address already in use'
E1013 15:43:31.228729 7415 portforward.go:419] error closing listener: close tcp4 127.0.0.1:39373: use of closed network connection
ERRO[2023-10-13T15:43:36Z] Expected Gateway to be reachable, instead an error was returned:
Expected to be to call `GetServices` and wait for server to be ready, instead a non-nil error was returned
--- at /root/project/cli/cli/kurtosis_gateway/server/engine_gateway/engine_gateway_service_server.go:323 (waitForGatewayReady) ---
Caused by: rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:39373: connect: connection refused"
INFO[2023-10-13T15:43:36Z] Forwarding requests to local port number '35083' to remote port with id 'grpc'
which happens exactly when the test itself fails:
time="2023-10-13T15:43:36Z" level=info msg="Uploading and executing package 'github.com/kurtosis-tech/sample-startosis-load/replace-subpackage'"
--- FAIL: TestStartosisReplaceTestSuite (86.00s)
--- FAIL: TestStartosisReplaceTestSuite/TestStartosisReplaceWithModuleInDirectory (17.40s)
suite.go:87: test panicked: runtime error: invalid memory address or nil pointer dereference
So I'm pretty certain the true cause is something funky in the test infra / timing / stuff closing or timing out / etc
@omar711 Are you still investigating/working on that?
Hi @laurentluce , not actively. I dig in when I see something I can debug, then leave it be. e.g. https://github.com/kurtosis-tech/kurtosis/pull/1559 fixed a bug that might hide underlying failures. I don't think I've got near to the root of it though.
What's your CLI version?
0.84.3
Description & steps to reproduce
I've seen the
TestStartosisAddServiceTestSuite
test fail twice in CI checks with seemingly unrelated PRs. The failures can be fixed by re-running, where they usually then pass.Examples: https://app.circleci.com/pipelines/github/kurtosis-tech/kurtosis/7789/workflows/26e1b319-a461-455a-b790-a27fe7bcd6ae/jobs/105839 and https://app.circleci.com/pipelines/github/kurtosis-tech/kurtosis/7775/workflows/b40031d3-47fd-43b9-8265-94fe5d59ea5e/jobs/105604
Interesting notes:
Which suggests there's something awkward in this test that makes it a little brittle.
Desired behavior
Tests + CI run reliably
What is the severity of this bug?
Papercut; this bug is frustrating, but I have a workaround.
What area of the product does this pertain to?
Other: anything not covered by the above