Open mimowo opened 3 months ago
/cc @mbobrovskyi @trasc
/kind flake
/assign
It looks like we give 30s to teardown a kind cluster. I'm wondering if increasing the timeout to 45s could help.
That timeout is part of the test image and the failure at that point is already ignored.
The issue is indeed related to kind delete
however since is very little we ca do about it and it has nothing to do with the e2e suites we should just ignore it as we do wit other cleanup steps.
This shouldn't be happening and as far as I know isn't in Kubernetes's e2e tests.
In the future when you see issues like this please go ahead and reach out to the kind project.
cc @aojea
I'm fairly occupied today but can probably dig into this by sometime Monday.
This shouldn't be happening and as far as I know isn't in Kubernetes's e2e tests.
Right, I've never seen this in the code k8s, but this is the first time I see it in Kueue too, so maybe this is some very rare one-off.
refused to die.
😮💨
ERROR: failed to delete cluster "kind-manager": failed to delete nodes: command "docker rm -f -v kind-manager-control-plane" failed with error: exit status 1 Command Output: Error response from daemon: cannot remove container "/kind-manager-control-plane": could not kill: tried to kill container, but did not receive an exit event
what are these e2e doing with the network @mimowo ?
[38;5;243m/home/prow/go/src/kubernetes-sigs/kueue/test/e2e/multikueue/e2e_test.go:463[0m
[1mSTEP:[0m wait for check active [38;5;243m@ 07/31/24 18:01:33.731[0m
[1mSTEP:[0m Disconnecting worker1 container from the kind network [38;5;243m@ 07/31/24 18:01:34.06[0m
[1mSTEP:[0m Waiting for the cluster to become inactive [38;5;243m@ 07/31/24 18:01:34.54[0m
[1mSTEP:[0m Reconnecting worker1 container to the kind network [38;5;243m@ 07/31/24 18:02:19.212[0m
[1mSTEP:[0m Waiting for the cluster do become active [38;5;243m@ 07/31/24 18:02:49.147[0m
[38;5;10m• [77.390 seconds][0m
refused to die.
😮💨
This is something that we observe on every built - also successful, but I see it also on JobSet e2e tests and core k8s e2e tests, example link.
what are these e2e doing with the network @mimowo ?
These are tests for MultiKueue. We run 3 Kind clusters (one manager and 2 workers). We disconnect the network between the manager and a worker using that command: docker network disconnect kind kind-worker1-control-plane
. Later we re-connect the clusters.
It is done to simulate transient connectivity issues between the clusters.
seems the other bug reported https://github.com/kubernetes/kubernetes/issues/123313 with the same symptom was fixed by https://github.com/kubernetes/test-infra/pull/32245
This is something that we observe on every built - also successful, but I see it also on JobSet e2e tests and core k8s e2e tests, example link.
Yeah, that's different, the process cleanup of the docker daemon is less concerning when we're succesfully deleting the node containers (which we did in that link, prior to the issue turning down the docker daemon), I don't think that's related but should also be tracked (https://github.com/kubernetes/test-infra/issues/33227).
In that link we can see kind delete cluster
successfully deleting the nodes without timeout issues.
so maybe this is some very rare one-off.
I think they may also run on different clusters (k8s-infra-prow-build vs the k8s infra EKS cluster) with different OS, machine type, etc. There may be some different quirk with the environment between them.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
What happened:
The period e2e test failed on deleting kind cluster: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-kueue-test-multikueue-e2e-main/1818706058688860160.
This may happen when the entire test suite is green.
It looks like a rare flake (https://testgrid.k8s.io/sig-scheduling#periodic-kueue-test-multikueue-e2e-main):
What you expected to happen:
No random failures.
How to reproduce it (as minimally and precisely as possible):
Repeat the build, it happened on periodic build.
Anything else we need to know?:
The logs from the failure:
It looks like we give 30s to teardown a kind cluster. I'm wondering if increasing the timeout to 45s could help.