Open cahillsf opened 2 hours ago
This issue is currently awaiting triage.
If CAPI contributors determine this is a relevant issue, they will accept it by applying the triage/accepted
label and provide further guidance.
The triage/accepted
label can be added by org members by writing /triage accepted
in a comment.
Which jobs are flaking?
periodic-cluster-api-e2e-main
Which tests are flaking?
When testing clusterctl upgrades (v0.3=>v1.5=>current) Should create a management cluster and then upgrade all the providers
specifically looking at this pattern:
36 failures:
Timed out waiting for all Machines to exist
Since when has it been flaking?
for quite some time
Testgrid link
https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api#capi-e2e-main
Reason for failure (if possible)
TLDR: seems like there is an issue in the Docker controller in creating the worker machines with many logs like:
dockermachine_controller.go:220] controllers/DockerMachine/DockerMachine-controller "msg"="failed to create worker DockerMachine: timed out waiting for the condition, cleaning up so we can re-provision from a clean state"
The failures follow different patternsI think increasing
clusterctl-upgrade/wait-worker-nodes
e2e interval is a good first step as some of the examples below show the retry feedback in creating theDockerMachine
s appearing to resolve but the tests time outthe errors don't provide much info, but since the issue seems to stem from the docker container runtime
RunContainer
(https://github.com/kubernetes-sigs/cluster-api/blob/879617dcc25735ef734d33adad9618707d43a95b/test/infrastructure/docker/internal/docker/manager.go#L182) call, we could explore passing in an output Writer here as the third parameter to get output directly from the container?Example failure 1 (expected 2, found 1)
(
When testing clusterctl upgrades (v0.3=>v1.5=>current) Should create a management cluster and then upgrade all the providers
) prow linkcapd controller logs show:
I0831 03:46:04.834929 1 dockermachine_controller.go:220] controllers/DockerMachine/DockerMachine-controller "msg"="failed to create worker DockerMachine: timed out waiting for the condition, cleaning up so we can re-provision from a clean state" "cluster"="clusterctl-upgrade-workload-q53g9g" "docker-cluster"="clusterctl-upgrade-workload-q53g9g" "docker-machine"={"Namespace":"clusterctl-upgrade","Name":"clusterctl-upgrade-workload-q53g9g-md-0-lfnm6"} "machine"="clusterctl-upgrade-workload-q53g9g-md-0-5fb876f4bc-cq8k6"
in this case the CP machine is fully initialized, but the MD worker machine status shows:
Example failure 2 (expected 2, found 1)
(
When testing clusterctl upgrades (v0.3=>v1.5=>current) Should create a management cluster and then upgrade all the providers
) prow linkcapd controller logs show:
I0831 04:16:20.583537 1 dockermachine_controller.go:220] controllers/DockerMachine/DockerMachine-controller "msg"="failed to create worker DockerMachine: timed out waiting for the condition, cleaning up so we can re-provision from a clean state" "cluster"="clusterctl-upgrade-workload-merjuc" "docker-cluster"="clusterctl-upgrade-workload-merjuc" "docker-machine"={"Namespace":"clusterctl-upgrade","Name":"clusterctl-upgrade-workload-merjuc-md-0-sr89b"} "machine"="clusterctl-upgrade-workload-merjuc-md-0-856f7cbb7c-fjfdl"
in this failed test, we also see several failed creations of the control plane machine:
I0831 04:12:46.050065 1 dockermachine_controller.go:220] controllers/DockerMachine/DockerMachine-controller "msg"="failed to create worker DockerMachine: timed out waiting for the condition, cleaning up so we can re-provision from a clean state" "cluster"="clusterctl-upgrade-workload-merjuc" "docker-cluster"="clusterctl-upgrade-workload-merjuc" "docker-machine"={"Namespace":"clusterctl-upgrade","Name":"clusterctl-upgrade-workload-merjuc-control-plane-76jpw"} "machine"="clusterctl-upgrade-workload-merjuc-control-plane-kw497"
most have the same error, the other CP machine creation failure looks like this:
I0831 04:13:25.895170 1 dockermachine_controller.go:220] controllers/DockerMachine/DockerMachine-controller "msg"="failed to create worker DockerMachine: command \"docker run --detach --tty --privileged --security-opt seccomp=unconfined --tmpfs /tmp --tmpfs /run --volume /var --volume /lib/modules:/lib/modules:ro --hostname clusterctl-upgrade-workload-merjuc-control-plane-kw497 --network kind --name clusterctl-upgrade-workload-merjuc-control-plane-kw497 --label io.x-k8s.kind.cluster=clusterctl-upgrade-workload-merjuc --label io.x-k8s.kind.role=control-plane --expose 44139 --volume=/var/run/docker.sock:/var/run/docker.sock --publish=127.0.0.1:44139:6443/TCP kindest/node:v1.22.17@sha256:9af784f45a584f6b28bce2af84c494d947a05bd709151466489008f80a9ce9d5\" failed with error: exit status 125, cleaning up so we can re-provision from a clean state" "cluster"="clusterctl-upgrade-workload-merjuc" "docker-cluster"="clusterctl-upgrade-workload-merjuc" "docker-machine"={"Namespace":"clusterctl-upgrade","Name":"clusterctl-upgrade-workload-merjuc-control-plane-76jpw"} "machine"="clusterctl-upgrade-workload-merjuc-control-plane-kw497"
these appear to resolve and the provider ID gets set on the CP node:
I0831 04:15:55.173110 1 machine.go:337] controllers/DockerMachine/DockerMachine-controller "msg"="Setting Kubernetes node providerID" "cluster"="clusterctl-upgrade-workload-merjuc" "docker-cluster"="clusterctl-upgrade-workload-merjuc" "docker-machine"={"Namespace":"clusterctl-upgrade","Name":"clusterctl-upgrade-workload-merjuc-control-plane-76jpw"} "machine"="clusterctl-upgrade-workload-merjuc-control-plane-kw497"
in this case, the MD DockerMachine is still Waiting for BootstrapData
the CP DockerMachine only transitioned to
Ready
only 12 seconds before the test timed out:Example failure 3 (expected 2, found 0)
[It] When testing clusterctl upgrades (v0.3=>v1.5=>current) Should create a management cluster and then upgrade all the providers
: prow linkin this case, it seems that MD is waiting for the control plane (https://storage.googleapis.com/kubernetes-jenkins/logs/periodic-cluster-api-e2e-main/1829446304858640384/artifacts/clusters/clusterctl-upgrade-management-hywi70/resources/clusterctl-upgrade/DockerMachine/clusterctl-upgrade-workload-66mpos-md-0-ltqf6.yaml):
looks like all the
Failed to create worker DockerMachine
errors in the CAPD controller logs in this case relate to the control plane: https://storage.googleapis.com/kubernetes-jenkins/logs/periodic-cluster-api-e2e-main/1829446304858640384/artifacts/clusters/clusterctl-upgrade-management-hywi70/logs/capd-system/capd-controller-manager/capd-controller-manager-fb4b578f9-rzpq9/manager.log:10:06:52.444
:Anything else we need to know?
No response
Label(s) to be applied
/kind flake One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.