kubernetes-sigs / cluster-api

Home for Cluster API, a subproject of sig-cluster-lifecycle
https://cluster-api.sigs.k8s.io
Apache License 2.0
3.49k stars 1.29k forks source link

CAPD does not create HAProxy Loadbalancer container at the first cluster creation #7740

Closed criscola closed 1 year ago

criscola commented 1 year ago

What steps did you take and what happened: I create a CAPD cluster locally following the quickstart. Complete list of steps:

export CLUSTER_TOPOLOGY=true

cat > kind-cluster-with-extramounts.yaml <<EOF
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
  extraMounts:
    - hostPath: /var/run/docker.sock
      containerPath: /var/run/docker.sock
EOF

kind delete cluster
kind create cluster --config kind-cluster-with-extramounts.yaml
clusterctl init --infrastructure docker

clusterctl generate cluster capi-quickstart --flavor development \
  --kubernetes-version v1.25.3 \
  --control-plane-machine-count=3 \
  --worker-machine-count=3 \
  > capi-quickstart.yaml

kubectl apply -f capi-quickstart.yaml

then I look at the CAPD controller's logs, e.g.

kubectl logs -n capd-system capd-controller-manager-55c6f7887d-pkb7w 

where I see the logs:

E1213 13:40:20.163324       1 controller.go:326] "Reconciler error" err="failed to get ip for the load balancer: load balancer IP cannot be empty: container capi-quickstart-lb does not have an associated IP address" controller="dockercluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="DockerCluster" DockerCluster="default/capi-quickstart-lmhsm" namespace="default" name="capi-quickstart-lmhsm" reconcileID=fb7ba986-d9c7-4597-8f8a-b7d989a01601

on further inspection with docker ps, the Docker container capi-quickstart-lb is, as expected, not present.

At this point I delete the cluster kubectl delete cluster capi-quickstart and recreate it again. Immediately after, the loadbalancer Docker container gets created and the control plane results initialized. Weird, right?

What did you expect to happen: I expect my cluster to not require recreation to become initialized.

Environment:

Server: Engine: Version: 20.10.21 API version: 1.41 (minimum version 1.12) Go version: go1.19.2 Git commit: 3056208812 Built: Thu Oct 27 21:29:34 2022 OS/Arch: linux/amd64 Experimental: false containerd: Version: v1.6.9 GitCommit: 1c90a442489720eec95342e1789ee8a5e1b9536f.m runc: Version: 1.1.4 GitCommit:
docker-init: Version: 0.19.0 GitCommit: de40ad0


/kind bug
/area provider/docker
k8s-ci-robot commented 1 year ago

@criscola: This issue is currently awaiting triage.

If CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
fabriziopandini commented 1 year ago

/triage needs-information

@criscola Are you using docker installed with snap? we got report in the past about strange behaviour when using docker installed from snap.

Otherwise, we need some more info to investigate whats going on, because it doesn't happen nor on CI nor on our dev envs. Can we get the docker inspect of the container not reporting the ip? What we do when reading ips is ~equal to docker inspect your-contatiner | jq '.[0].NetworkSettings.Networks["kind"] | { ipv4: .IPAddress, ipv6: .GlobalIPv6Address }' Note: kind is the only network in our containers

criscola commented 1 year ago

Hi Fabrizio, docker was installed using pacman.

To precise, the lb container doesn't get brought up for an already created cluster after booting my workstation, so I can't get any IP. Afterwards, if I recreate the cluster the container is also not created until the 2nd cluster resource application. After that it finally gets created and the CAPD cluster works:

{
  "ipv4": "172.18.0.2",
  "ipv6": "fc00:f853:ccd:e793::2"
}

In the meantime I tried getting some logs immediately after booting up my machine with journalctl -fu docker.service and I get the following:

Dec 20 13:10:38 lxray dockerd[1165]: time="2022-12-20T13:10:38.230504583+01:00" level=error msg="Error setting up exec command in container mycluster-dev-lb: Container 9427845ce3fded0d4e8115a678b2eeebb3feee3c86785bdeb107f3d2e6518788 is not running"

so it seems like the container is for some reason, unable to run. This is different than the containers for the CAPD nodes which are brought up without issues.

killianmuldoon commented 1 year ago

@criscola can you try running a haproxy container outside of CAPD? This is how I managed to debug similar issues in the past -

The command I was using was:

docker run -v $(pwd)/images/haproxy:/usr/local/etc/conf:ro  haproxy:2.6 -f /usr/local/etc/conf/haproxy.cfg  -d -V

Where $(pwd)/images/haproxy points to a directory containing a valid haproxy cfg file called haproxy.cfg. In my case there was a bug with Fedora wherre haproxy would start, use up all available resources, and then crash.

criscola commented 1 year ago

Do you have a working haproxy.cfg file around?

killianmuldoon commented 1 year ago

Apologies - you can use the cfg in the kind repo at https://github.com/kubernetes-sigs/kind/tree/main/images/haproxy.

criscola commented 1 year ago

Thanks, I was struggling to find one. Here's what I get (I run it after rebooting so when it should have been brought up):

[NOTICE]   (1) : haproxy version is 2.6.7-c55bfdb
[NOTICE]   (1) : path to executable is /usr/local/sbin/haproxy
[WARNING]  (1) : config : missing timeouts for frontend 'controlPlane'.
   | While not properly invalid, you will certainly encounter various problems
   | with such a configuration. To fix this, please ensure that all following
   | timeouts are set to a non-zero value: 'client', 'connect', 'server'.
Note: setting global.maxconn to 524241.
Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result FAILED
Total: 3 (2 usable), will use epoll.

Available filters :
        [CACHE] cache
        [COMP] compression
        [FCGI] fcgi-app
        [SPOE] spoe
        [TRACE] trace
Using epoll() as the polling mechanism.
[NOTICE]   (1) : New worker (8) forked
[NOTICE]   (1) : Loading success.
Using epoll() as the polling mechanism.
Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available filters :
        [CACHE] cache
        [COMP] compression
        [FCGI] fcgi-app
        [SPOE] spoe
        [TRACE] trace
killianmuldoon commented 1 year ago

That looks pretty okay / as expected too me - it's similar to what I see on my machine. Can you retrieve the logs or the failure reason for the haproxy container started by CAPD? Docker inspect might give you some information for how and why the lb container failed.

criscola commented 1 year ago

Here it is:

[WARNING] 353/131805 (1) : config : missing timeouts for frontend 'controlPlane'.                                                                                                                                                                     
   | While not properly invalid, you will certainly encounter various problems
   | with such a configuration. To fix this, please ensure that all following
   | timeouts are set to a non-zero value: 'client', 'connect', 'server'.
[NOTICE] 353/131805 (1) : New worker #1 (8) forked
[WARNING] 353/131809 (1) : Reexecuting Master process
[NOTICE] 353/131809 (1) : haproxy version is 2.2.9-2~bpo10+1
[NOTICE] 353/131809 (1) : path to executable is /usr/sbin/haproxy
[ALERT] 353/131809 (1) : sendmsg()/writev() failed in logger #1: No such file or directory (errno=2)
[WARNING] 353/131809 (8) : Stopping frontend controlPlane in 0 ms.
[WARNING] 353/131809 (8) : Stopping backend kube-apiservers in 0 ms.
[WARNING] 353/131809 (8) : Stopping frontend GLOBAL in 0 ms.
[WARNING] 353/131809 (8) : Proxy controlPlane stopped (cumulated conns: FE: 6, BE: 0).
[WARNING] 353/131809 (8) : Proxy kube-apiservers stopped (cumulated conns: FE: 0, BE: 6).
[WARNING] 353/131809 (8) : Proxy GLOBAL stopped (cumulated conns: FE: 0, BE: 0).
[NOTICE] 353/131809 (1) : New worker #1 (45) forked
[WARNING] 353/131809 (1) : Former worker #1 (8) exited with code 0 (Exit)
[WARNING] 353/131809 (45) : Server kube-apiservers/ecoqube-dev-hmhsb-xnw4b is DOWN, reason: Layer4 connection problem, info: "SSL handshake failure", check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[ALERT] 353/131809 (45) : backend 'kube-apiservers' has no server available!
[WARNING] 353/131828 (45) : Server kube-apiservers/ecoqube-dev-hmhsb-xnw4b is UP, reason: Layer7 check passed, code: 200, check duration: 3ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
[WARNING] 353/132448 (1) : Exiting Master process...
[WARNING] 353/132448 (45) : Stopping frontend control-plane in 0 ms.
[WARNING] 353/132448 (45) : Stopping backend kube-apiservers in 0 ms.
[WARNING] 353/132448 (45) : Stopping frontend GLOBAL in 0 ms.
[WARNING] 353/132448 (45) : Proxy control-plane stopped (cumulated conns: FE: 207, BE: 0).
[WARNING] 353/132448 (45) : Proxy kube-apiservers stopped (cumulated conns: FE: 0, BE: 207).
[WARNING] 353/132448 (45) : Proxy GLOBAL stopped (cumulated conns: FE: 0, BE: 0).
[WARNING] 353/132449 (1) : Current worker #1 (45) exited with code 0 (Exit)
[WARNING] 353/132449 (1) : All workers exited. Exiting... (0)
criscola commented 1 year ago

Couldn't it retry again instead of just exiting? Why is there a different behavior for the haproxy at boot time vs after? That's the log for when it stays up:

[WARNING] 353/134529 (1) : config : missing timeouts for frontend 'controlPlane'.                                                                                                                                                                     
   | While not properly invalid, you will certainly encounter various problems
   | with such a configuration. To fix this, please ensure that all following
   | timeouts are set to a non-zero value: 'client', 'connect', 'server'.
[NOTICE] 353/134529 (1) : New worker #1 (8) forked
[WARNING] 353/134531 (1) : Reexecuting Master process
[NOTICE] 353/134531 (1) : haproxy version is 2.2.9-2~bpo10+1
[NOTICE] 353/134531 (1) : path to executable is /usr/sbin/haproxy
[ALERT] 353/134531 (1) : sendmsg()/writev() failed in logger #1: No such file or directory (errno=2)
[WARNING] 353/134531 (8) : Stopping frontend controlPlane in 0 ms.
[WARNING] 353/134531 (8) : Stopping backend kube-apiservers in 0 ms.
[WARNING] 353/134531 (8) : Stopping frontend GLOBAL in 0 ms.
[WARNING] 353/134531 (8) : Proxy controlPlane stopped (cumulated conns: FE: 6, BE: 0).
[WARNING] 353/134531 (8) : Proxy kube-apiservers stopped (cumulated conns: FE: 0, BE: 6).
[WARNING] 353/134531 (8) : Proxy GLOBAL stopped (cumulated conns: FE: 0, BE: 0).
[NOTICE] 353/134531 (1) : New worker #1 (45) forked
[WARNING] 353/134531 (1) : Former worker #1 (8) exited with code 0 (Exit)
[WARNING] 353/134532 (45) : Server kube-apiservers/ecoqube-dev-bbqhg-42n7f is DOWN, reason: Layer4 connection problem, info: "SSL handshake failure", check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[ALERT] 353/134532 (45) : backend 'kube-apiservers' has no server available!
[WARNING] 353/134551 (45) : Server kube-apiservers/ecoqube-dev-bbqhg-42n7f is UP, reason: Layer7 check passed, code: 200, check duration: 2ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.

this time the haproxy doesn't exit and the cluster can be initialized, seems weird to me that it works differently

killianmuldoon commented 1 year ago

Nothing in those logs (other than "Exiting Master process...") looks unusual. Can you get information on what actually killed the haproxy container from Docker inspect?

criscola commented 1 year ago

Not seeing much here, was there even an attempt to restart it?

[─ docker inspect 68c11dc186c1                                                                                                                                                                                                                       
    {
        "Id": "68c11dc186c167288e6b08e857af8887359ab4a81fa943dc8bbc1ebd59709946",
        "Created": "2022-12-20T13:45:28.762773705Z",
        "Path": "haproxy",
        "Args": [
            "-sf",
            "7",
            "-W",
            "-db",
            "-f",
            "/usr/local/etc/haproxy/haproxy.cfg"
        ],
        "State": {
            "Status": "exited",
            "Running": false,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 0,
            "ExitCode": 0,
            "Error": "",
            "StartedAt": "2022-12-20T13:45:29.042208224Z",
            "FinishedAt": "2022-12-20T13:52:04.616036701Z"
        },
        "Image": "sha256:083ad526a17e665e04888ac20ffb085a3a83b058b831ea5cf28164c9fc36c306",
        "ResolvConfPath": "/var/lib/docker/containers/68c11dc186c167288e6b08e857af8887359ab4a81fa943dc8bbc1ebd59709946/resolv.conf",
        "HostnamePath": "/var/lib/docker/containers/68c11dc186c167288e6b08e857af8887359ab4a81fa943dc8bbc1ebd59709946/hostname",
        "HostsPath": "/var/lib/docker/containers/68c11dc186c167288e6b08e857af8887359ab4a81fa943dc8bbc1ebd59709946/hosts",
        "LogPath": "/var/lib/docker/containers/68c11dc186c167288e6b08e857af8887359ab4a81fa943dc8bbc1ebd59709946/68c11dc186c167288e6b08e857af8887359ab4a81fa943dc8bbc1ebd59709946-json.log",
        "Name": "/ecoqube-dev-lb",
        "RestartCount": 0,
        "Driver": "overlay2",
        "Platform": "linux",
        "MountLabel": "",
        "ProcessLabel": "",
        "AppArmorProfile": "unconfined",
        "ExecIDs": null,
        "HostConfig": {
            "Binds": [
                "/lib/modules:/lib/modules:ro"
            ],
            "ContainerIDFile": "",
            "LogConfig": {
                "Type": "json-file",
                "Config": {}
            },
            "NetworkMode": "kind",
            "PortBindings": {
                "6443/tcp": [
                    {
                        "HostIp": "0.0.0.0",
                        "HostPort": "37973"
                    }
                ]
            },
            "RestartPolicy": {
                "Name": "unless-stopped",
                "MaximumRetryCount": 0
            },
            "AutoRemove": false,
            "VolumeDriver": "",
            "VolumesFrom": null,
            "CapAdd": null,
            "CapDrop": null,
            "CgroupnsMode": "private",
            "Dns": [],
            "DnsOptions": [],
            "DnsSearch": [],
            "ExtraHosts": null,
            "GroupAdd": null,
            "IpcMode": "private",
            "Cgroup": "",
            "Links": null,
            "OomScoreAdj": 0,
            "PidMode": "",
            "Privileged": true,
            "PublishAllPorts": false,
            "ReadonlyRootfs": false,
            "SecurityOpt": [
                "seccomp=unconfined",
                "label=disable"
            ],
            "Tmpfs": {
                "/run": "",
                "/tmp": ""
            },
            "UTSMode": "",
            "UsernsMode": "",
            "ShmSize": 67108864,
            "Runtime": "runc",
            "ConsoleSize": [
                0,
                0
            ],
            "Isolation": "",
            "CpuShares": 0,
            "Memory": 0,
            "NanoCpus": 0,
            "CgroupParent": "",
            "BlkioWeight": 0,
            "BlkioWeightDevice": null,
            "BlkioDeviceReadBps": null,
            "BlkioDeviceWriteBps": null,
            "BlkioDeviceReadIOps": null,
            "BlkioDeviceWriteIOps": null,
            "CpuPeriod": 0,
            "CpuQuota": 0,
            "CpuRealtimePeriod": 0,
            "CpuRealtimeRuntime": 0,
            "CpusetCpus": "",
            "CpusetMems": "",
            "Devices": null,
            "DeviceCgroupRules": null,
            "DeviceRequests": null,
            "KernelMemory": 0,
            "KernelMemoryTCP": 0,
            "MemoryReservation": 0,
            "MemorySwap": 0,
            "MemorySwappiness": null,
            "OomKillDisable": null,
            "PidsLimit": null,
            "Ulimits": null,
            "CpuCount": 0,
            "CpuPercent": 0,
            "IOMaximumIOps": 0,
            "IOMaximumBandwidth": 0,
            "MaskedPaths": null,
            "ReadonlyPaths": null
        },
        "GraphDriver": {
            "Data": {
                "LowerDir": "/var/lib/docker/overlay2/df2e414347b2c3b8a77ec5fb11fc5136305a5e84314c84da268c0e3fda126082-init/diff:/var/lib/docker/overlay2/58b22042a6cda25a15ca6e1c3a39500b2c3b24bc0a5fa34c7b11058346396628/diff:/var/lib/docker/overlay2/1ea3071304fa913301959e1da221f569e18505520d99078de07aea0a2d84f51a/diff:/var/lib/docker/overlay2/f5b35a9600d3b7350e48b753302a7c0d5a3bae7b48bd637785046546acb57d23/diff:/var/lib/docker/overlay2/cd9ece03c61a95227cd7df10670ad59882e0c0ecbee17b5a3a9de5ae2281f015/diff:/var/lib/docker/overlay2/9d3af073ed4440b6e4c94d17ca0479f25817fb19edae3e2c55e70b7b5a275596/diff",
                "MergedDir": "/var/lib/docker/overlay2/df2e414347b2c3b8a77ec5fb11fc5136305a5e84314c84da268c0e3fda126082/merged",
                "UpperDir": "/var/lib/docker/overlay2/df2e414347b2c3b8a77ec5fb11fc5136305a5e84314c84da268c0e3fda126082/diff",
                "WorkDir": "/var/lib/docker/overlay2/df2e414347b2c3b8a77ec5fb11fc5136305a5e84314c84da268c0e3fda126082/work"
            },
            "Name": "overlay2"
        },
        "Mounts": [
            {
                "Type": "bind",
                "Source": "/lib/modules",
                "Destination": "/lib/modules",
                "Mode": "ro",
                "RW": false,
                "Propagation": "rprivate"
            },
            {
                "Type": "volume",
                "Name": "bdb23fbb33fdde3878015b72fe4456da150804f1a44e320a9cd23e36bcfae1eb",
                "Source": "/var/lib/docker/volumes/bdb23fbb33fdde3878015b72fe4456da150804f1a44e320a9cd23e36bcfae1eb/_data",
                "Destination": "/var",
                "Driver": "local",
                "Mode": "",
                "RW": true,
                "Propagation": ""
            }
        ],
        "Config": {
            "Hostname": "ecoqube-dev-lb",
            "Domainname": "",
            "User": "0",
            "AttachStdin": false,
            "AttachStdout": false,
            "AttachStderr": false,
            "ExposedPorts": {
                "37973/tcp": {},
                "6443/tcp": {}
            },
            "Tty": true,
            "OpenStdin": false,
            "StdinOnce": false,
            "Env": [
                "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
                "SSL_CERT_FILE=/etc/ssl/certs/ca-certificates.crt"
            ],
            "Cmd": null,
            "Image": "kindest/haproxy:v20210715-a6da3463",
            "Volumes": {
                "/var": {}
            },
            "WorkingDir": "/",
            "Entrypoint": [
                "haproxy",
                "-sf",
                "7",
                "-W",
                "-db",
                "-f",
                "/usr/local/etc/haproxy/haproxy.cfg"
            ],
            "OnBuild": null,
            "Labels": {
                "io.x-k8s.kind.cluster": "ecoqube-dev",
                "io.x-k8s.kind.role": "external-load-balancer"
            },
            "StopSignal": "SIGUSR1"
        },
        "NetworkSettings": {
            "Bridge": "",
            "SandboxID": "4f91480bf67346919d74edd49617018fd92167f1876387945ea67630f99bab86",
            "HairpinMode": false,
            "LinkLocalIPv6Address": "",
            "LinkLocalIPv6PrefixLen": 0,
            "Ports": {},
            "SandboxKey": "/var/run/docker/netns/4f91480bf673",
            "SecondaryIPAddresses": null,
            "SecondaryIPv6Addresses": null,
            "EndpointID": "",
            "Gateway": "",
            "GlobalIPv6Address": "",
            "GlobalIPv6PrefixLen": 0,
            "IPAddress": "",
            "IPPrefixLen": 0,
            "IPv6Gateway": "",
            "MacAddress": "",
            "Networks": {
                "kind": {
                    "IPAMConfig": null,
                    "Links": null,
                    "Aliases": [
                        "68c11dc186c1",
                        "ecoqube-dev-lb"
                    ],
                    "NetworkID": "29a148be6436fea9f8a0d3e5eb550756fcf513d75c01f7d15b58135e3f6ab785",
                    "EndpointID": "",
                    "Gateway": "",
                    "IPAddress": "",
                    "IPPrefixLen": 0,
                    "IPv6Gateway": "",
                    "GlobalIPv6Address": "",
                    "GlobalIPv6PrefixLen": 0,
                    "MacAddress": "",
                    "DriverOpts": null
                }
            }
        }
    }
]
killianmuldoon commented 1 year ago

Hmm - I'm really not sure what's going on here - maybe there's some additional info in journalctl for docker? I still don't really understand what causes the haproxy container to exit.

sbueringer commented 1 year ago

Just for confirmation. The current error is still:

E1213 13:40:20.163324 1 controller.go:326] "Reconciler error" err="failed to get ip for the load balancer: load balancer IP cannot be empty: container capi-quickstart-lb does not have an associated IP address" controller="dockercluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="DockerCluster" DockerCluster="default/capi-quickstart-lmhsm" namespace="default" name="capi-quickstart-lmhsm" reconcileID=fb7ba986-d9c7-4597-8f8a-b7d989a01601

The last time I had this error my docker network was not configured correctly (the docker network didn't exist) and thus the lb container didn't get an IP. Usually kind create creates the network in the Docker engine and then CAPD re-uses it. A correct network should look something like this:

$ docker network list -f name=kind
NETWORK ID     NAME      DRIVER    SCOPE
56aa1f164407   kind      bridge    local
$ docker network inspect kind
[
    {
        "Name": "kind",
        "Id": "56aa1f164407508286bd7e7b56f48dc40c2b2f3be4566abed9f50e2c60d8ec03",
        "Created": "2022-10-20T20:04:57.24810756+02:00",
        "Scope": "local",
        "Driver": "bridge",
        "EnableIPv6": true,
        "IPAM": {
            "Driver": "default",
            "Options": {},
            "Config": [
                {
                    "Subnet": "172.18.0.0/16",
                    "Gateway": "172.18.0.1"
                },
                {
                    "Subnet": "fc00:f853:ccd:e793::/64",
                    "Gateway": "fc00:f853:ccd:e793::1"
                }
            ]
        },
        "Internal": false,
        "Attachable": false,
        "Ingress": false,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": {
            "a919a4b491f5ca226117ddb39667374c8c53bd7e5394d75c30ead489222973d0": {
                "Name": "capi-test-control-plane",
                "EndpointID": "791a6236f4dbe9e1bb68e7a7354bc8315caf15d5f5100c1f763b0d1b8622cfca",
                "MacAddress": "02:42:ac:12:00:02",
                "IPv4Address": "172.18.0.2/16",
                "IPv6Address": "fc00:f853:ccd:e793::2/64"
            },
            "e46becf30116d3f0f2755fbb02c510e5f1ba7f540892108c214e34fe7840968a": {
                "Name": "kind-registry",
                "EndpointID": "89b3cffdbc97b929dea46e96f1e5987975a2f983729a424e5618638e1bd131b3",
                "MacAddress": "02:42:ac:12:00:06",
                "IPv4Address": "172.18.0.6/16",
                "IPv6Address": "fc00:f853:ccd:e793::6/64"
            }
        },
        "Options": {
            "com.docker.network.bridge.enable_ip_masquerade": "true",
            "com.docker.network.driver.mtu": "1500"
        },
        "Labels": {}
    }
]

(IPv6 stuff can vary a bit based on config)

criscola commented 1 year ago

Hi sbueringer I hope you had great holidays. This is what I get by running your same commands:

$ docker network list -f name=kind
NETWORK ID     NAME      DRIVER    SCOPE
29a148be6436   kind      bridge    local
$ docker network inspect kind
[
    {
        "Name": "kind",
        "Id": "29a148be6436fea9f8a0d3e5eb550756fcf513d75c01f7d15b58135e3f6ab785",
        "Created": "2022-05-06T10:19:41.805997131+02:00",
        "Scope": "local",
        "Driver": "bridge",
        "EnableIPv6": true,
        "IPAM": {
            "Driver": "default",
            "Options": {},
            "Config": [
                {
                    "Subnet": "172.18.0.0/16",
                    "Gateway": "172.18.0.1"
                },
                {
                    "Subnet": "fc00:f853:ccd:e793::/64",
                    "Gateway": "fc00:f853:ccd:e793::1"
                }
            ]
        },
        "Internal": false,
        "Attachable": false,
        "Ingress": false,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": {
            "14ee3d9ceecd3b62014a516b40c03fa681d8d59e7b5967088537d16e4b4932c6": {
                "Name": "ecoqube-dev-cjltn-9fnfn",
                "EndpointID": "aa3a717ee19355379050f6f67282fcf0827310f28eb4b6b060f6725ff0337529",
                "MacAddress": "02:42:ac:12:00:02",
                "IPv4Address": "172.18.0.2/16",
                "IPv6Address": "fc00:f853:ccd:e793::2/64"
            },
            "34fb562bdbb4e61076c202e5ede80a01e39f93d3451a522421c27885d140385f": {
                "Name": "ecoqube-dev-default-worker-topo-dt48g-b99d74ddc-nd6dg",
                "EndpointID": "5b176229c224cb0f50d49188eb9cb302cf97b2f330f34ac71398414e4aced489",
                "MacAddress": "02:42:ac:12:00:05",
                "IPv4Address": "172.18.0.5/16",
                "IPv6Address": "fc00:f853:ccd:e793::5/64"
            },
            "46d9d4c9fd27ea21d43ff7e547ecf36b8c1daaeab70828a62fb1692ed7452e80": {
                "Name": "kind-control-plane",
                "EndpointID": "923463d891f6270906fac0696707cc3de9a85c6cda8099469beea22322c51a66",
                "MacAddress": "02:42:ac:12:00:04",
                "IPv4Address": "172.18.0.4/16",
                "IPv6Address": "fc00:f853:ccd:e793::4/64"
            },
            "eab855ab9c9a715a2a07ee6c4ab55b1a1b0df4bdeb0e742b17ccf308c63fc7eb": {
                "Name": "ecoqube-dev-default-worker-topo-dt48g-b99d74ddc-vhjkd",
                "EndpointID": "23ec82c914495d062d9ea717d47393fb1508bc7438ff24159f56f5f617291bac",
                "MacAddress": "02:42:ac:12:00:06",
                "IPv4Address": "172.18.0.6/16",
                "IPv6Address": "fc00:f853:ccd:e793::6/64"
            },
            "fea8e436fb47f9488c685d2d7d487f91e77e544e64c8ce800f9a163436683b97": {
                "Name": "ecoqube-dev-default-worker-topo-dt48g-b99d74ddc-mxj5x",
                "EndpointID": "32f99c00929cf40d6ffec41cef7b45d2c40e0d6c9e86858ced4c25a4104f335d",
                "MacAddress": "02:42:ac:12:00:03",
                "IPv4Address": "172.18.0.3/16",
                "IPv6Address": "fc00:f853:ccd:e793::3/64"
            }
        },
        "Options": {
            "com.docker.network.bridge.enable_ip_masquerade": "true",
            "com.docker.network.driver.mtu": "1500"
        },
        "Labels": {}
    }
]

the proxy is not present. I wonder if I should seek to solve this somewhere more Docker related than the CAPI repo.

fabriziopandini commented 1 year ago

/triage not-reproducible

I wonder if I should seek to solve this somewhere more Docker related than the CAPI repo.

I kind of agree, it seems there is something going on at a lower level than the one where CAPI works

sbueringer commented 1 year ago

Hm yup. Network looks okay.

aauren commented 1 year ago

Sorry to necro this issue, but I think that I've found a reasonable way to reproduce this problem.

This has been happening to me when I have an old version of the haproxy container that has been killed in docker, but not removed. This typically happens when I lose my kind cluster that I run CAPI inside of and can no longer delete the cluster.

In this case, running docker ps -a on the machine will show the old containers:

$ docker ps -a
CONTAINER ID   IMAGE                                COMMAND                  CREATED          STATUS                     PORTS                       NAMES
2e95e78b5c7c   kindest/node:v1.26.3                 "/usr/local/bin/entr…"   17 minutes ago   Up 17 minutes              127.0.0.1:46021->6443/tcp   kubemark-control-plane
7da7c7cb20ef   kindest/node:v1.26.3                 "/usr/local/bin/entr…"   17 minutes ago   Up 17 minutes                                          kubemark-worker
581d77fac175   kindest/node:v1.25.3                 "/usr/local/bin/entr…"   3 weeks ago      Exited (130) 2 weeks ago                               kube-node-mgmt-control-plane-gxc9k
8a1aca619a83   kindest/haproxy:v20230227-d46f45b6   "haproxy -sf 7 -W -d…"   3 weeks ago      Exited (0) 2 weeks ago                                 kube-node-mgmt-lb

Using docker rm <container_id> at this point, will make it so that CAPI will come up correctly.

Steps for full reproduction:

  1. Create kind cluster (kind create cluster ...)
  2. Initialize CAPI (clusterctl init ...)
  3. Wait for pods to stabilize
  4. Generate the cluster (clusterctl generate cluster ... | kubectl apply -f -)
  5. See cluster come up correctly
  6. Delete kind cluster out from under it: (kind delete clusters ...)
  7. Kill the remaining CAPI containers (docker kill <container_id>) - This is just for simulation, in every day course I find that they usually kill themselves after a certain amount of time, or die and are never restarted
  8. Run steps 1 - 4 again
  9. See the cluster not come up successfully, and see the error in the description:
    2023-05-10T01:39:50.582661780Z E0510 01:39:50.582602       1 controller.go:329] "Reconciler error" err="failed to get ip for the load balancer: load balancer IP cannot be empty: container kube-node-mgmt-lb does not have an associated IP address" controller="dockercluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="DockerCluster" DockerCluster="default/kube-node-mgmt" namespace="default" name="kube-node-mgmt" reconcileID=77d50330-7185-4279-9891-7bbc5696de96
  10. Delete the old docker containers (haproxy and control-plane): docker rm <container_id>
  11. Delete the CAPI cluster: kubectl delete cluster ...
  12. Run step 4 again
  13. See the cluster come up correctly this time

It seems that if there is a remnant of the haproxy container hanging around, CAPD will wait around forever for it to come back up rather than either starting it again, or removing the container and re-creating it.

aauren commented 1 year ago

I don't know a ton about this codebase, but from what I can see, the loadbalancer container gets looked up here: https://github.com/kubernetes-sigs/cluster-api/blob/main/test/infrastructure/docker/internal/docker/loadbalancer.go#L62

And when the container exists, but is not running, then container is set on the LoadBalancer.

Later on when we then attempt to try to create the load balancer, we do a check to see if the container already exists (https://github.com/kubernetes-sigs/cluster-api/blob/main/test/infrastructure/docker/internal/docker/loadbalancer.go#L119) and since it does, we skip creating it.

However, at this point it is shut down and doesn't have an IP address on it so then when we go to fetch the IP later in the logic (https://github.com/kubernetes-sigs/cluster-api/blob/main/test/infrastructure/docker/internal/docker/loadbalancer.go#L201) we fall into the error condition.

@criscola would you be willing to consider re-opening this issue and look into the code points above to see if there is some way that this condition could be handled better?