kubernetes / dns

Kubernetes DNS service
Apache License 2.0
928 stars 464 forks source link

k8s-dns e2e test suite failing with exit status 1 at HEAD #646

Open DamianSawicki opened 1 week ago

DamianSawicki commented 1 week ago

pull-kubernetes-dns-test fails at HEAD (verified for the no-op PR https://github.com/kubernetes/dns/pull/645) as below:

...
2024/10/06 16:17:58 test | 2024/10/06 16:17:53 sidecar started
2024/10/06 16:17:58 test | 2024/10/06 16:17:53 running `dig`
2024/10/06 16:17:58 test | 2024/10/06 16:17:53 Waiting for hits to be reported to be greater than 100
2024/10/06 16:17:58 test | 
2024/10/06 16:17:58 All tests passed
2024/10/06 16:17:58 docker [rmi -f k8s-dns-sidecar-e2e-test]
Running Suite: k8s-dns e2e test suite
=====================================
Random Seed: 1728231478
Will run 5 of 5 specs
2024/10/06 16:18:20 exit status 1
Ginkgo ran 1 suite in 21.764852525s
Test Suite Failed

This (most probably) blocks a vulnerability-fix PR https://github.com/kubernetes/dns/pull/638 open since July for which tests are failing identically.

For the last merged PR https://github.com/kubernetes/dns/pull/635 the test pull-kubernetes-dns-test passed, so apparently the tests or test infra must have changed in the meantime. For https://github.com/kubernetes/dns/pull/638, the test failed identically on July 23rd, July 29th, and September 14th, so the issue seems to predate the August 2024 Prow migration.

DamianSawicki commented 1 week ago

I think the failing test is defined in test/e2e/e2e_test.go in the present repo. This means it has not been modified since https://github.com/kubernetes/dns/pull/635, so it is more of an infra thing.

When I tried to run the test locally, I got the message 2024/10/06 21:08:39 e2e test requires `sudo` to be active. Run `sudo -v` before running the e2e test., so perhaps it is a matter of permissions?

Also, in artifacts of the failed run, in the file podinfo.json, I've found the following:

                {
                    "name": "test",
                    "state": {
                        "terminated": {
                            "exitCode": 1,
                            "reason": "Error",
                            "message": " test | \n2024/10/06 16:17:58 All tests passed\n2024/10/06 16:17:58 docker [rmi -f k8s-dns-sidecar-e2e-test]\nRunning Suite: k8s-dns e2e test suite\n=====================================\nRandom Seed: \u001b[1m1728231478\u001b[0m\nWill run \u001b[1m5\u001b[0m of \u001b[1m5\u001b[0m specs\n\n2024/10/06 16:18:20 exit status 1\n\nGinkgo ran 1 suite in 21.764852525s\nTest Suite Failed\n\n\u001b[38;5;228mGinkgo 2.0 is coming soon!\u001b[0m\n\u001b[38;5;228m==========================\u001b[0m\n\u001b[1m\u001b[38;5;10mGinkgo 2.0\u001b[0m is under active development and will introduce several new features, improvements, and a small handful of breaking changes.\nA release candidate for 2.0 is now available and 2.0 should GA in Fall 2021.  \u001b[1mPlease give the RC a try and send us feedback!\u001b[0m\n  - To learn more, view the migration guide at \u001b[38;5;14m\u001b[4mhttps://github.com/onsi/ginkgo/blob/ver2/docs/MIGRATING_TO_V2.md\u001b[0m\n  - For instructions on using the Release Candidate visit \u001b[38;5;14m\u001b[4mhttps://github.com/onsi/ginkgo/blob/ver2/docs/MIGRATING_TO_V2.md#using-the-beta\u001b[0m\n  - To comment, chime in at \u001b[38;5;14m\u001b[4mhttps://github.com/onsi/ginkgo/issues/711\u001b[0m\n\nTo \u001b[1m\u001b[38;5;204msilence this notice\u001b[0m, set the environment variable: \u001b[1mACK_GINKGO_RC=true\u001b[0m\nAlternatively you can: \u001b[1mtouch $HOME/.ack-ginkgo-rc\u001b[0m\n+ EXIT_VALUE=1\n+ set +o xtrace\nCleaning up after docker in docker.\n================================================================================\nWaiting 30 seconds for pods stopped with terminationGracePeriod:30\nCleaning up after docker\nWaiting for docker to stop for 30 seconds\nStopping Docker: dockerProgram process in pidfile '/var/run/docker-ssd.pid', 1 process(es), refused to die.\n================================================================================\nDone cleaning up after docker in docker.\n{\"component\":\"entrypoint\",\"error\":\"wrapped process failed: exit status 1\",\"file\":\"sigs.k8s.io/prow/pkg/entrypoint/run.go:84\",\"func\":\"sigs.k8s.io/prow/pkg/entrypoint.Options.internalRun\",\"level\":\"error\",\"msg\":\"Error executing test process\",\"severity\":\"error\",\"time\":\"2024-10-06T16:19:10Z\"}\n",
                            "startedAt": "2024-10-06T15:55:53Z",
                            "finishedAt": "2024-10-06T16:19:10Z",
                            "containerID": "containerd://302c6068cdfb4c64dd8aafb8b56a4f61083e252a3c594e89249c2a568e443000"
                        }
                    },
                    "lastState": {},
                    "ready": false,
                    "restartCount": 0,
                    "image": "gcr.io/k8s-staging-test-infra/kubekins-e2e:v20240923-c8645c1a17-master",
                    "imageID": "gcr.io/k8s-staging-test-infra/kubekins-e2e@sha256:c5cf57a29e78a568ecf90a3b5b4df6b2afd5245c97edda91759e3e07f2330ba7",
                    "containerID": "containerd://302c6068cdfb4c64dd8aafb8b56a4f61083e252a3c594e89249c2a568e443000",
                    "started": false
                }

which mentions kubekins-e2e, which seems to be deprecated.

DamianSawicki commented 1 week ago

Hey @BenTheElder, I found you among the owners of kubekins-e2e mentioned above. Would you be able to look at the comments above and possibly share some advice?

BenTheElder commented 1 week ago

I don't work in this repo, but kubekins-e2e is an image we use currently to run some CI in the kubernetes project. It has a grab bag of tools like docker. Any other usage is best-effort.

podinfo.json is the pod in which we executed the PR tests. for more see https://docs.prow.k8s.io/docs/jobs/ and https://github.com/kubernetes/test-infra (config/)

BenTheElder commented 1 week ago

unless this project opted into it, the pod most likely ran as root, but it's hard to know without tracing the job specifics, e.g. you may have scheduled the test into the cluster under test (Which is NOT the cluster we use to run CI, that just executes the CI workloads, which then create disposable test clusters)

seems to predate the August 2024 Prow migration.

that migration was for the control plane. migrating the workloads was done prior to this, and varies by workload.

you can find this job's definition in the test-infra repo and see the git history there.

we're currently approach KEP Freeze, and I will be out for a few days after that, so time is tight this week đŸ˜…

DamianSawicki commented 1 week ago

Ben, thank you very much for your responses!

@VikashLNU @zhangguanzhang You can have a look at the comments above to try to unblock the PR https://github.com/kubernetes/dns/pull/638 you're interested in.

zhangguanzhang commented 1 week ago

Ben, thank you very much for your responses!

@VikashLNU @zhangguanzhang You can have a look at the comments above to try to unblock the PR #638 you're interested in.

I don't see how to resolve the issue, but once someone fixes the CI build problem, I can rebase my code onto the master branch and push it.