[Flaky test] e2e tests occasionally fail when deleting kind cluster

mimowo commented 3 months ago

What happened:

The period e2e test failed on deleting kind cluster: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-kueue-test-multikueue-e2e-main/1818706058688860160.

This may happen when the entire test suite is green.

It looks like a rare flake (https://testgrid.k8s.io/sig-scheduling#periodic-kueue-test-multikueue-e2e-main):

What you expected to happen:

No random failures.

How to reproduce it (as minimally and precisely as possible):

Repeat the build, it happened on periodic build.

Anything else we need to know?:

The logs from the failure:

Ginkgo ran 1 suite in 3m16.816345076s
Test Suite Passed
Switched to context "kind-kind-manager".
Exporting logs for cluster "kind-manager" to:
/logs/artifacts/run-test-multikueue-e2e-1.30.0
No resources found in default namespace.
Deleting cluster "kind-manager" ...
ERROR: failed to delete cluster "kind-manager": failed to delete nodes: command "docker rm -f -v kind-manager-control-plane" failed with error: exit status 1

Command Output: Error response from daemon: cannot remove container "/kind-manager-control-plane": could not kill: tried to kill container, but did not receive an exit event
make: *** [Makefile-test.mk:100: run-test-multikueue-e2e-1.30.0] Error 1
+ EXIT_VALUE=2
+ set +o xtrace
Cleaning up after docker in docker.
================================================================================
Waiting 30 seconds for pods stopped with terminationGracePeriod:30
Cleaning up after docker
ed68da3fb667
6beb571d417e
bf442dfcffc1
Waiting for docker to stop for 30 seconds
Stopping Docker: dockerProgram process in pidfile '/var/run/docker-ssd.pid', 1 process(es), refused to die.

It looks like we give 30s to teardown a kind cluster. I'm wondering if increasing the timeout to 45s could help.

mimowo commented 3 months ago

/cc @mbobrovskyi @trasc

mimowo commented 3 months ago

/kind flake

trasc commented 3 months ago

/assign

trasc commented 3 months ago

It looks like we give 30s to teardown a kind cluster. I'm wondering if increasing the timeout to 45s could help.

That timeout is part of the test image and the failure at that point is already ignored.

The issue is indeed related to kind delete however since is very little we ca do about it and it has nothing to do with the e2e suites we should just ignore it as we do wit other cleanup steps.

BenTheElder commented 3 months ago

This shouldn't be happening and as far as I know isn't in Kubernetes's e2e tests.

In the future when you see issues like this please go ahead and reach out to the kind project.

cc @aojea

I'm fairly occupied today but can probably dig into this by sometime Monday.

mimowo commented 3 months ago

This shouldn't be happening and as far as I know isn't in Kubernetes's e2e tests.

Right, I've never seen this in the code k8s, but this is the first time I see it in Kueue too, so maybe this is some very rare one-off.

aojea commented 3 months ago

refused to die.

😮‍💨

https://storage.googleapis.com/kubernetes-jenkins/logs/periodic-kueue-test-multikueue-e2e-main/1818706058688860160/build-log.txt

ERROR: failed to delete cluster "kind-manager": failed to delete nodes: command "docker rm -f -v kind-manager-control-plane" failed with error: exit status 1 Command Output: Error response from daemon: cannot remove container "/kind-manager-control-plane": could not kill: tried to kill container, but did not receive an exit event

what are these e2e doing with the network @mimowo ?

[38;5;243m/home/prow/go/src/kubernetes-sigs/kueue/test/e2e/multikueue/e2e_test.go:463[0m
  [1mSTEP:[0m wait for check active [38;5;243m@ 07/31/24 18:01:33.731[0m
  [1mSTEP:[0m Disconnecting worker1 container from the kind network [38;5;243m@ 07/31/24 18:01:34.06[0m
  [1mSTEP:[0m Waiting for the cluster to become inactive [38;5;243m@ 07/31/24 18:01:34.54[0m
  [1mSTEP:[0m Reconnecting worker1 container to the kind network [38;5;243m@ 07/31/24 18:02:19.212[0m
  [1mSTEP:[0m Waiting for the cluster do become active [38;5;243m@ 07/31/24 18:02:49.147[0m
[38;5;10m• [77.390 seconds][0m

mimowo commented 3 months ago

refused to die.

😮‍💨

This is something that we observe on every built - also successful, but I see it also on JobSet e2e tests and core k8s e2e tests, example link.

what are these e2e doing with the network @mimowo ?

These are tests for MultiKueue. We run 3 Kind clusters (one manager and 2 workers). We disconnect the network between the manager and a worker using that command: docker network disconnect kind kind-worker1-control-plane. Later we re-connect the clusters.

It is done to simulate transient connectivity issues between the clusters.

aojea commented 3 months ago

seems the other bug reported https://github.com/kubernetes/kubernetes/issues/123313 with the same symptom was fixed by https://github.com/kubernetes/test-infra/pull/32245

BenTheElder commented 3 months ago

This is something that we observe on every built - also successful, but I see it also on JobSet e2e tests and core k8s e2e tests, example link.

Yeah, that's different, the process cleanup of the docker daemon is less concerning when we're succesfully deleting the node containers (which we did in that link, prior to the issue turning down the docker daemon), I don't think that's related but should also be tracked (https://github.com/kubernetes/test-infra/issues/33227).

In that link we can see kind delete cluster successfully deleting the nodes without timeout issues.

so maybe this is some very rare one-off.

I think they may also run on different clusters (k8s-infra-prow-build vs the k8s infra EKS cluster) with different OS, machine type, etc. There may be some different quirk with the environment between them.

k8s-triage-robot commented 1 week ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

kubernetes-sigs / kueue

[Flaky test] e2e tests occasionally fail when deleting kind cluster #2738