clusterctl upgrade `Timed out waiting for all Machines to exist`

Which jobs are flaking?

periodic-cluster-api-e2e-main

Which tests are flaking?

When testing clusterctl upgrades (v0.3=>v1.5=>current) Should create a management cluster and then upgrade all the providers

specifically looking at this pattern:

36 failures: Timed out waiting for all Machines to exist
- Component: unknown
- Branches:
- main
- release-1.8
- release-1.7

Since when has it been flaking?

for quite some time

Testgrid link

https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api#capi-e2e-main

Reason for failure (if possible)

TLDR: seems like there is an issue in the Docker controller in creating the worker machines with many logs like: dockermachine_controller.go:220] controllers/DockerMachine/DockerMachine-controller "msg"="failed to create worker DockerMachine: timed out waiting for the condition, cleaning up so we can re-provision from a clean state" The failures follow different patterns

I think increasing clusterctl-upgrade/wait-worker-nodes e2e interval is a good first step as some of the examples below show the retry feedback in creating the DockerMachines appearing to resolve but the tests time out

the errors don't provide much info, but since the issue seems to stem from the docker container runtime RunContainer(https://github.com/kubernetes-sigs/cluster-api/blob/879617dcc25735ef734d33adad9618707d43a95b/test/infrastructure/docker/internal/docker/manager.go#L182) call, we could explore passing in an output Writer here as the third parameter to get output directly from the container?

Example failure 1 (expected 2, found 1)

(When testing clusterctl upgrades (v0.3=>v1.5=>current) Should create a management cluster and then upgrade all the providers) prow link

capd controller logs show:

I0831 03:46:04.834929 1 dockermachine_controller.go:220] controllers/DockerMachine/DockerMachine-controller "msg"="failed to create worker DockerMachine: timed out waiting for the condition, cleaning up so we can re-provision from a clean state" "cluster"="clusterctl-upgrade-workload-q53g9g" "docker-cluster"="clusterctl-upgrade-workload-q53g9g" "docker-machine"={"Namespace":"clusterctl-upgrade","Name":"clusterctl-upgrade-workload-q53g9g-md-0-lfnm6"} "machine"="clusterctl-upgrade-workload-q53g9g-md-0-5fb876f4bc-cq8k6"

in this case the CP machine is fully initialized, but the MD worker machine status shows:

status:
  conditions:
  - lastTransitionTime: "2024-08-31T03:46:07Z"
    message: 0 of 2 completed
    reason: ContainerProvisioningFailed
    severity: Warning
    status: "False"
    type: Ready
  - lastTransitionTime: "2024-08-31T03:46:07Z"
    message: Re-provisioning
    reason: ContainerProvisioningFailed
    severity: Warning
    status: "False"
    type: ContainerProvisioned

Example failure 2 (expected 2, found 1)

(When testing clusterctl upgrades (v0.3=>v1.5=>current) Should create a management cluster and then upgrade all the providers) prow link

capd controller logs show:

I0831 04:16:20.583537 1 dockermachine_controller.go:220] controllers/DockerMachine/DockerMachine-controller "msg"="failed to create worker DockerMachine: timed out waiting for the condition, cleaning up so we can re-provision from a clean state" "cluster"="clusterctl-upgrade-workload-merjuc" "docker-cluster"="clusterctl-upgrade-workload-merjuc" "docker-machine"={"Namespace":"clusterctl-upgrade","Name":"clusterctl-upgrade-workload-merjuc-md-0-sr89b"} "machine"="clusterctl-upgrade-workload-merjuc-md-0-856f7cbb7c-fjfdl"

in this failed test, we also see several failed creations of the control plane machine: I0831 04:12:46.050065 1 dockermachine_controller.go:220] controllers/DockerMachine/DockerMachine-controller "msg"="failed to create worker DockerMachine: timed out waiting for the condition, cleaning up so we can re-provision from a clean state" "cluster"="clusterctl-upgrade-workload-merjuc" "docker-cluster"="clusterctl-upgrade-workload-merjuc" "docker-machine"={"Namespace":"clusterctl-upgrade","Name":"clusterctl-upgrade-workload-merjuc-control-plane-76jpw"} "machine"="clusterctl-upgrade-workload-merjuc-control-plane-kw497"

most have the same error, the other CP machine creation failure looks like this: I0831 04:13:25.895170 1 dockermachine_controller.go:220] controllers/DockerMachine/DockerMachine-controller "msg"="failed to create worker DockerMachine: command \"docker run --detach --tty --privileged --security-opt seccomp=unconfined --tmpfs /tmp --tmpfs /run --volume /var --volume /lib/modules:/lib/modules:ro --hostname clusterctl-upgrade-workload-merjuc-control-plane-kw497 --network kind --name clusterctl-upgrade-workload-merjuc-control-plane-kw497 --label io.x-k8s.kind.cluster=clusterctl-upgrade-workload-merjuc --label io.x-k8s.kind.role=control-plane --expose 44139 --volume=/var/run/docker.sock:/var/run/docker.sock --publish=127.0.0.1:44139:6443/TCP kindest/node:v1.22.17@sha256:9af784f45a584f6b28bce2af84c494d947a05bd709151466489008f80a9ce9d5\" failed with error: exit status 125, cleaning up so we can re-provision from a clean state" "cluster"="clusterctl-upgrade-workload-merjuc" "docker-cluster"="clusterctl-upgrade-workload-merjuc" "docker-machine"={"Namespace":"clusterctl-upgrade","Name":"clusterctl-upgrade-workload-merjuc-control-plane-76jpw"} "machine"="clusterctl-upgrade-workload-merjuc-control-plane-kw497"

these appear to resolve and the provider ID gets set on the CP node: I0831 04:15:55.173110 1 machine.go:337] controllers/DockerMachine/DockerMachine-controller "msg"="Setting Kubernetes node providerID" "cluster"="clusterctl-upgrade-workload-merjuc" "docker-cluster"="clusterctl-upgrade-workload-merjuc" "docker-machine"={"Namespace":"clusterctl-upgrade","Name":"clusterctl-upgrade-workload-merjuc-control-plane-76jpw"} "machine"="clusterctl-upgrade-workload-merjuc-control-plane-kw497"

in this case, the MD DockerMachine is still Waiting for BootstrapData

the CP DockerMachine only transitioned to Ready only 12 seconds before the test timed out:

  conditions:
  - lastTransitionTime: "2024-08-31T04:15:56Z"
    status: "True"
    type: Ready
  - lastTransitionTime: "2024-08-31T04:15:56Z"
    status: "True"
    type: BootstrapExecSucceeded

Example failure 3 (expected 2, found 0)

[It] When testing clusterctl upgrades (v0.3=>v1.5=>current) Should create a management cluster and then upgrade all the providers: prow link

in this case, it seems that MD is waiting for the control plane (https://storage.googleapis.com/kubernetes-jenkins/logs/periodic-cluster-api-e2e-main/1829446304858640384/artifacts/clusters/clusterctl-upgrade-management-hywi70/resources/clusterctl-upgrade/DockerMachine/clusterctl-upgrade-workload-66mpos-md-0-ltqf6.yaml):

status:
  conditions:
  - lastTransitionTime: "2024-08-30T10:01:44Z"
    message: 0 of 2 completed
    reason: WaitingForControlPlaneAvailable
    severity: Info
    status: "False"
    type: Ready
  - lastTransitionTime: "2024-08-30T10:01:44Z"
    reason: WaitingForControlPlaneAvailable
    severity: Info
    status: "False"
    type: ContainerProvisioned

looks like all the Failed to create worker DockerMachine errors in the CAPD controller logs in this case relate to the control plane: https://storage.googleapis.com/kubernetes-jenkins/logs/periodic-cluster-api-e2e-main/1829446304858640384/artifacts/clusters/clusterctl-upgrade-management-hywi70/logs/capd-system/capd-controller-manager/capd-controller-manager-fb4b578f9-rzpq9/manager.log:

  I0830 10:04:39.240063       1 dockermachine_controller.go:220] controllers/DockerMachine/DockerMachine-controller "msg"="failed to create worker DockerMachine: timed out waiting for the condition, cleaning up so we can re-provision from a clean state" "cluster"="clusterctl-upgrade-workload-66mpos" "docker-cluster"="clusterctl-upgrade-workload-66mpos" "docker-machine"={"Namespace":"clusterctl-upgrade","Name":"clusterctl-upgrade-workload-66mpos-control-plane-jbbxg"} "machine"="clusterctl-upgrade-workload-66mpos-control-plane-8kvwp"

it does seem that eventually the CP machine comes up, but it's still bootstrapping when the test times out at 10:06:52.444:
```
status:
conditions:
```
lastTransitionTime: "2024-08-30T10:06:52Z" message: 1 of 2 completed reason: Bootstrapping severity: Info status: "False" type: Ready
lastTransitionTime: "2024-08-30T10:06:52Z" reason: Bootstrapping severity: Info status: "False" type: BootstrapExecSucceeded
lastTransitionTime: "2024-08-30T10:06:52Z" status: "True" type: ContainerProvisioned

Anything else we need to know?

No response

Label(s) to be applied

/kind flake One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

kubernetes-sigs / cluster-api