kurtosis-tech / kurtosis

A platform for packaging and launching ephemeral backend stacks with a focus on approachability for the average developer.
https://docs.kurtosistech.com/
Apache License 2.0
351 stars 47 forks source link

Intermittent test failures involving TestStartosisAddServiceTestSuite #1538

Open omar711 opened 11 months ago

omar711 commented 11 months ago

What's your CLI version?

0.84.3

Description & steps to reproduce

I've seen the TestStartosisAddServiceTestSuite test fail twice in CI checks with seemingly unrelated PRs. The failures can be fixed by re-running, where they usually then pass.

Examples: https://app.circleci.com/pipelines/github/kurtosis-tech/kurtosis/7789/workflows/26e1b319-a461-455a-b790-a27fe7bcd6ae/jobs/105839 and https://app.circleci.com/pipelines/github/kurtosis-tech/kurtosis/7775/workflows/b40031d3-47fd-43b9-8265-94fe5d59ea5e/jobs/105604

Interesting notes:

Which suggests there's something awkward in this test that makes it a little brittle.

Desired behavior

Tests + CI run reliably

What is the severity of this bug?

Papercut; this bug is frustrating, but I have a workaround.

What area of the product does this pertain to?

Other: anything not covered by the above

omar711 commented 11 months ago

One more here: https://app.circleci.com/pipelines/github/kurtosis-tech/kurtosis/7874/workflows/63979657-73a6-4919-b27b-b6fc9d0814b2/jobs/107163

This time not add service, but TestStartosisReplaceTestSuite:

TestStartosisReplaceTestSuite/TestStartosisReplaceWithModuleInDirectory (17.40s)
        suite.go:87: test panicked: runtime error: invalid memory address or nil pointer dereference

also looks random, and possibly something being torn down prematurely

mieubrisse commented 11 months ago
        suite.go:87: test panicked: runtime error: invalid memory address or nil pointer dereference

looks like a plain bug: our code being wrong, nothing to do with CI.

I really, really, really think we should prioritize digging out what the cause is and fix these... flaky tests erode the trust in the CI across the entire team, and with merge queues flaky tests are only going exacerbate the problem. Our CI test suite didn't used to be flaky, and in the past ~6-9 months, it's gotten increasingly so.

omar711 commented 11 months ago

Cool found the issue in the tests themselves (excuse my go learning adventure):

func (suite *StartosisReplaceTestSuite) TestStartosisReplaceWithModuleInDirectory() {
    ctx := context.Background()
    runResult, _ := suite.RunPackageWithParams(ctx, packageWithReplaceModuleInDirectoryRelPath, packageWithReplaceModuleInDirectoryParams)

    t := suite.T()

    require.Nil(t, runResult.InterpretationError)
    require.Empty(t, runResult.ValidationErrors)
    require.Nil(t, runResult.ExecutionError)
    expectedResult := "Replace with module in directory sample package loaded.\nVerification succeeded. Value is '\"another-dependency-loaded-from-internal-module-in-main-branch\"'.\n"
    require.Regexp(t, expectedResult, string(runResult.RunOutput))
}

The error at suite.RunPackageWithParams... is ignored, which will cause the null pointer issue at the first require. Tested this locally by overriding its return values, e.g.:

    runResult, err := suite.RunPackageWithParams(ctx, packageWithReplaceModuleInDirectoryRelPath, packageWithReplaceModuleInDirectoryParams)
    runResult = nil
    err = fmt.Errorf("oh dear")
...

Then after that adding a require.NoError(t, err) will reveal the actual error that's occurring. I'll try to find other instances of this and apply the same fix into a PR.

That will only show us the real errors, and it only affects 1 of the 3 cases above, from what I can see. There's still something weird in the test infra / CI (long running tests, maybe things timing out, etc)

https://github.com/kurtosis-tech/kurtosis/pull/1559 contains the fix for this one, plus similar tests

omar711 commented 11 months ago

One more random failure (occurred locally with ./internal_testsuites/scripts/test.sh, re-running fixing it):

--- FAIL: TestIdentifiers (1.33s)
    enclave_and_service_identifiers_test.go:56:
            Error Trace:    /Users/omar/kurtosis/kurtosis/internal_testsuites/golang/testsuite/enclave_and_service_identifiers_test/enclave_and_service_identifiers_test.go:56
            Error:          Received unexpected error:
                            An error occurred adding the datastore service
                             --- at /Users/omar/kurtosis/kurtosis/internal_testsuites/golang/test_helpers/test_helpers.go:220 (AddDatastoreService) ---
                            Caused by: An error has occurred when running Starlark to add service
                             --- at /Users/omar/kurtosis/kurtosis/internal_testsuites/golang/test_helpers/test_helpers.go:162 (AddService) ---
                            Caused by: Error requesting Starlark Script run
                             --- at /Users/omar/kurtosis/kurtosis/api/golang/core/lib/enclaves/enclave_context.go:119 (EnclaveContext.RunStarlarkScriptBlocking) ---
                            Caused by: Unexpected error happened executing Kurtosis script.
                             --- at /Users/omar/kurtosis/kurtosis/api/golang/core/lib/enclaves/enclave_context.go:104 (EnclaveContext.RunStarlarkScript) ---
                            Caused by: rpc error: code = Unavailable desc = connection error: desc = "error reading server preface: http2: frame too large"
            Test:           TestIdentifiers
            Messages:       An error occurred adding the datastore service to the enclave
FAIL
FAIL    github.com/kurtosis-tech/kurtosis-cli/golang_internal_testsuite/testsuite/enclave_and_service_identifiers_test  1.491s
omar711 commented 11 months ago

In relation to the panic error https://app.circleci.com/pipelines/github/kurtosis-tech/kurtosis/7874/workflows/63979657-73a6-4919-b27b-b6fc9d0814b2/jobs/107163 , I've just noticed you can see in the gateway log:

INFO[2023-10-13T15:43:31Z] Forwarding requests to local port number '39373' to remote port with id 'grpc' 
INFO[2023-10-13T15:43:31Z] Closing connection to pod: https://0.0.0.0:40395/api/v1/namespaces/kt-idle-enclave-3eed8f5bc86342ee9def0471466a9c3a/pods/kurtosis-api/portforward 
WARN[2023-10-13T15:43:31Z] Expected to run api container gateway until stopped, but the server exited prematurely with a non-nil error: 'Expected to run API container gateway server until stopped, but the server exited with a non-nil error
 --- at /root/project/cli/cli/kurtosis_gateway/run/api_container_gateway/run_until_stopped.go:48 (RunApiContainerGatewayUntilStopped) ---
Caused by: An error occurred creating the listener on tcp/39373
 --- at /go/pkg/mod/github.com/kurtosis-tech/minimal-grpc-server/golang@v0.0.0-20230710164206-90b674acb269/server/server.go:111 (MinimalGRPCServer.RunUntilStopped) ---
Caused by: listen tcp :39373: bind: address already in use' 
E1013 15:43:31.228729    7415 portforward.go:419] error closing listener: close tcp4 127.0.0.1:39373: use of closed network connection
ERRO[2023-10-13T15:43:36Z] Expected Gateway to be reachable, instead an error was returned:
Expected to be to call `GetServices` and wait for server to be ready, instead a non-nil error was returned
 --- at /root/project/cli/cli/kurtosis_gateway/server/engine_gateway/engine_gateway_service_server.go:323 (waitForGatewayReady) ---
Caused by: rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:39373: connect: connection refused" 
INFO[2023-10-13T15:43:36Z] Forwarding requests to local port number '35083' to remote port with id 'grpc' 

which happens exactly when the test itself fails:

time="2023-10-13T15:43:36Z" level=info msg="Uploading and executing package 'github.com/kurtosis-tech/sample-startosis-load/replace-subpackage'"
--- FAIL: TestStartosisReplaceTestSuite (86.00s)
    --- FAIL: TestStartosisReplaceTestSuite/TestStartosisReplaceWithModuleInDirectory (17.40s)
        suite.go:87: test panicked: runtime error: invalid memory address or nil pointer dereference

So I'm pretty certain the true cause is something funky in the test infra / timing / stuff closing or timing out / etc

laurentluce commented 11 months ago

@omar711 Are you still investigating/working on that?

omar711 commented 11 months ago

Hi @laurentluce , not actively. I dig in when I see something I can debug, then leave it be. e.g. https://github.com/kurtosis-tech/kurtosis/pull/1559 fixed a bug that might hide underlying failures. I don't think I've got near to the root of it though.