siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.
https://www.talos.dev
Mozilla Public License 2.0
6.96k stars 565 forks source link

talosctl cluster create timeout #8908

Open Geethree opened 5 months ago

Geethree commented 5 months ago

When following the sidero getting started to bring up a cluster with:

export HOST_IP="192.168.1.175"

talosctl cluster create \
  --name sidero-demo \
  -p 67:67/udp,69:69/udp,8081:8081/tcp,51821:51821/udp \
  --workers 0 \
  --config-patch '[{"op": "add", "path": "/cluster/allowSchedulingOnControlPlanes", "value": true}]' \
  --endpoint $HOST_IP

Where it states that the HOST_IP should be the IP address of my workstation which I included above. When I do this, the tooling hangs here:

geethree@masterblaster:/tmp$ talosctl cluster create   --name sidero-demo   -p 67:67/udp,69:69/udp,8081:8081/tcp,51821:51821/udp   --workers 0   --config-patch '[{"op": "add", "path": "/cluster/allowSchedulingOnControlPlanes", "value": true}]'   --endpoint $HOST_IP
validating CIDR and reserving IPs
generating PKI and tokens
creating network sidero-demo
creating controlplane nodes
creating worker nodes
waiting for API

It seems like etcd isn't coming up

talos] 2024/06/01 19:57:29 initialize sequence: 4 phase(s)
[talos] 2024/06/01 19:57:29 phase systemRequirements (1/4): 1 tasks(s)
[talos] 2024/06/01 19:57:29 task setupSystemDirectory (1/1): starting
[talos] 2024/06/01 19:57:29 task setupSystemDirectory (1/1): done, 76.279µs
[talos] 2024/06/01 19:57:29 phase systemRequirements (1/4): done, 118.976µs
[talos] 2024/06/01 19:57:29 phase etc (2/4): 3 tasks(s)
[talos] 2024/06/01 19:57:29 task setUserEnvVars (3/3): starting
[talos] 2024/06/01 19:57:29 task setUserEnvVars (3/3): done, 11.564µs
[talos] 2024/06/01 19:57:29 task createOSReleaseFile (2/3): starting
[talos] 2024/06/01 19:57:29 task CreateSystemCgroups (1/3): starting
[talos] 2024/06/01 19:57:29 task createOSReleaseFile (2/3): done, 173.486µs
[talos] 2024/06/01 19:57:29 pre-created iptables-nft table 'mangle'/'KUBE-IPTABLES-HINT' {"component": "controller-runtime", "controller": "network.NfTablesChainController"}
[talos] 2024/06/01 19:57:29 node identity established {"component": "controller-runtime", "controller": "cluster.NodeIdentityController", "node_id": "QjSC6dMbE2z99WPz1nu26EY8ce2YOQrUyuZGCJw1TGF"}
[talos] 2024/06/01 19:57:29 nftables chains updated {"component": "controller-runtime", "controller": "network.NfTablesChainController", "chains": []}
[talos] 2024/06/01 19:57:29 setting resolvers {"component": "controller-runtime", "controller": "network.ResolverSpecController", "resolvers": ["127.0.0.11"]}
[talos] 2024/06/01 19:57:29 setting time servers {"component": "controller-runtime", "controller": "network.TimeServerSpecController", "addresses": ["time.cloudflare.com"]}
[talos] 2024/06/01 19:57:29 setting resolvers {"component": "controller-runtime", "controller": "network.ResolverSpecController", "resolvers": ["127.0.0.11"]}
[talos] 2024/06/01 19:57:29 setting time servers {"component": "controller-runtime", "controller": "network.TimeServerSpecController", "addresses": ["time.cloudflare.com"]}
[talos] 2024/06/01 19:57:29 task CreateSystemCgroups (1/3): done, 6.854672ms
[talos] 2024/06/01 19:57:29 phase etc (2/4): done, 6.907579ms
[talos] 2024/06/01 19:57:29 phase machined (3/4): 2 tasks(s)
[talos] 2024/06/01 19:57:29 task startContainerd (2/2): starting
[talos] 2024/06/01 19:57:29 task startMachined (1/2): starting
[talos] 2024/06/01 19:57:29 TPM device is not available, skipping PCR extension
[talos] 2024/06/01 19:57:29 service[containerd](Starting): Starting service
[talos] 2024/06/01 19:57:29 service[containerd](Preparing): Running pre state
[talos] 2024/06/01 19:57:29 service[containerd](Preparing): Creating service runner
[talos] 2024/06/01 19:57:29 service[machined](Starting): Starting service
[talos] 2024/06/01 19:57:29 service[machined](Preparing): Running pre state
[talos] 2024/06/01 19:57:29 service[machined](Preparing): Creating service runner
[talos] 2024/06/01 19:57:29 service[machined](Running): Service started as goroutine
[talos] 2024/06/01 19:57:29 service[containerd](Running): Process Process(["/bin/containerd" "--address" "/system/run/containerd/containerd.sock" "--state" "/system/run/containerd" "--root" "/system/var/lib/containerd"]) started with PID 25
[talos] 2024/06/01 19:57:30 service[machined](Running): Health check successful
[talos] 2024/06/01 19:57:30 task startMachined (1/2): done, 1.001118118s
[talos] 2024/06/01 19:57:30 service[containerd](Running): Health check successful
[talos] 2024/06/01 19:57:30 task startContainerd (2/2): done, 1.002830367s
[talos] 2024/06/01 19:57:30 phase machined (3/4): done, 1.002878378s
[talos] 2024/06/01 19:57:30 phase config (4/4): 1 tasks(s)
[talos] 2024/06/01 19:57:30 task loadConfig (1/1): starting
[talos] 2024/06/01 19:57:30 downloading config {"component": "controller-runtime", "controller": "config.AcquireController", "platform": "container"}
[talos] 2024/06/01 19:57:30 fetching machine config from: USERDATA environment variable
[talos] 2024/06/01 19:57:30 machine config loaded successfully {"component": "controller-runtime", "controller": "config.AcquireController", "sources": ["container"]}
[talos] 2024/06/01 19:57:30 task loadConfig (1/1): done, 3.659106ms
[talos] 2024/06/01 19:57:30 phase config (4/4): done, 3.705849ms
[talos] 2024/06/01 19:57:30 initialize sequence: done: 1.0136549s
[talos] 2024/06/01 19:57:30 install sequence: 0 phase(s)
[talos] 2024/06/01 19:57:30 install sequence: done: 7.032µs
[talos] 2024/06/01 19:57:30 service[apid](Starting): Starting service
[talos] 2024/06/01 19:57:30 service[apid](Waiting): Waiting for service "containerd" to be "up", api certificates
[talos] 2024/06/01 19:57:30 boot sequence: 10 phase(s)
[talos] 2024/06/01 19:57:30 phase saveConfig (1/10): 1 tasks(s)
[talos] 2024/06/01 19:57:30 task saveConfig (1/1): starting
[talos] 2024/06/01 19:57:30 task saveConfig (1/1): done, 145.985µs
[talos] 2024/06/01 19:57:30 phase saveConfig (1/10): done, 177.718µs
[talos] 2024/06/01 19:57:30 phase memorySizeCheck (2/10): 1 tasks(s)
[talos] 2024/06/01 19:57:30 task memorySizeCheck (1/1): starting
[talos] 2024/06/01 19:57:30 skipping memory size check in the container
[talos] 2024/06/01 19:57:30 task memorySizeCheck (1/1): done, 28.208µs
[talos] 2024/06/01 19:57:30 phase memorySizeCheck (2/10): done, 47.712µs
[talos] 2024/06/01 19:57:30 phase diskSizeCheck (3/10): 1 tasks(s)
[talos] 2024/06/01 19:57:30 task diskSizeCheck (1/1): starting
[talos] 2024/06/01 19:57:30 skipping disk size check in the container
[talos] 2024/06/01 19:57:30 task diskSizeCheck (1/1): done, 14.101µs
[talos] 2024/06/01 19:57:30 phase diskSizeCheck (3/10): done, 31.096µs
[talos] 2024/06/01 19:57:30 phase env (4/10): 1 tasks(s)
[talos] 2024/06/01 19:57:30 task setUserEnvVars (1/1): starting
[talos] 2024/06/01 19:57:30 task setUserEnvVars (1/1): done, 9.191µs
[talos] 2024/06/01 19:57:30 phase env (4/10): done, 24.882µs
[talos] 2024/06/01 19:57:30 phase dbus (5/10): 1 tasks(s)
[talos] 2024/06/01 19:57:30 task startDBus (1/1): starting
[talos] 2024/06/01 19:57:30 kubeprism KubePrism is enabled {"component": "controller-runtime", "controller": "k8s.KubePrismController", "endpoint": "127.0.0.1:7445"}
[talos] 2024/06/01 19:57:30 task startDBus (1/1): done, 650.353µs
[talos] 2024/06/01 19:57:30 phase dbus (5/10): done, 682.486µs
[talos] 2024/06/01 19:57:30 phase sharedFilesystems (6/10): 1 tasks(s)
[talos] 2024/06/01 19:57:30 task setupSharedFilesystems (1/1): starting
[talos] 2024/06/01 19:57:30 task setupSharedFilesystems (1/1): done, 44.09µs
[talos] 2024/06/01 19:57:30 phase sharedFilesystems (6/10): done, 65.623µs
[talos] 2024/06/01 19:57:30 phase var (7/10): 1 tasks(s)
[talos] 2024/06/01 19:57:30 assigned address {"component": "controller-runtime", "controller": "network.AddressSpecController", "address": "10.96.0.9/32", "link": "lo"}
[talos] 2024/06/01 19:57:30 task setupVarDirectory (1/1): starting
[talos] 2024/06/01 19:57:30 task setupVarDirectory (1/1): done, 584.631µs
[talos] 2024/06/01 19:57:30 phase var (7/10): done, 617.583µs
[talos] 2024/06/01 19:57:30 phase userSetup (8/10): 1 tasks(s)
[talos] 2024/06/01 19:57:30 task writeUserFiles (1/1): starting
[talos] 2024/06/01 19:57:30 task writeUserFiles (1/1): done, 10.368µs
[talos] 2024/06/01 19:57:30 phase userSetup (8/10): done, 35.04µs
[talos] 2024/06/01 19:57:30 phase extendPCRStartAll (9/10): 1 tasks(s)
[talos] 2024/06/01 19:57:30 task extendPCRStartAll (1/1): starting
[talos] 2024/06/01 19:57:30 TPM device is not available, skipping PCR extension
[talos] 2024/06/01 19:57:30 task extendPCRStartAll (1/1): done, 54.229µs
[talos] 2024/06/01 19:57:30 phase extendPCRStartAll (9/10): done, 87.985µs
[talos] 2024/06/01 19:57:30 phase startEverything (10/10): 1 tasks(s)
[talos] 2024/06/01 19:57:30 task startAllServices (1/1): starting
[talos] 2024/06/01 19:57:30 service[cri](Starting): Starting service
[talos] 2024/06/01 19:57:30 service[cri](Waiting): Waiting for network
[talos] 2024/06/01 19:57:30 service[cri](Preparing): Running pre state
[talos] 2024/06/01 19:57:30 service[cri](Preparing): Creating service runner
[talos] 2024/06/01 19:57:30 service[trustd](Starting): Starting service
[talos] 2024/06/01 19:57:30 service[trustd](Waiting): Waiting for service "containerd" to be "up", time sync, network
[talos] 2024/06/01 19:57:30 service[cri](Running): Process Process(["/bin/containerd" "--address" "/run/containerd/containerd.sock" "--config" "/etc/cri/containerd.toml"]) started with PID 59
[talos] 2024/06/01 19:57:30 service[etcd](Starting): Starting service
[talos] 2024/06/01 19:57:30 service[etcd](Waiting): Waiting for service "cri" to be "up", time sync, network, etcd spec
[talos] task startAllServices (1/1): 2024/06/01 19:57:30 waiting for 7 services
[talos] task startAllServices (1/1): 2024/06/01 19:57:30 service "apid" to be "up", service "containerd" to be "up", service "cri" to be "up", service "etcd" to be "up", service "kubelet" to be "up", service "machined" to be "up", service "trustd" to be "up"
[talos] 2024/06/01 19:57:30 service[trustd](Preparing): Running pre state
[talos] 2024/06/01 19:57:30 created dns upstream {"component": "controller-runtime", "controller": "network.DNSUpstreamController", "addr": "127.0.0.11"}
[talos] 2024/06/01 19:57:30 updated dns server nameservers {"component": "dns-resolve-cache", "addrs": ["127.0.0.11:53"]}
[talos] 2024/06/01 19:57:30 service[trustd](Preparing): Creating service runner
[talos] 2024/06/01 19:57:30 service[apid](Preparing): Running pre state
[talos] 2024/06/01 19:57:30 service[apid](Preparing): Creating service runner
[talos] 2024/06/01 19:57:30 service[kubelet](Starting): Starting service
[talos] 2024/06/01 19:57:30 service[kubelet](Waiting): Waiting for service "cri" to be "up", time sync, network
[talos] 2024/06/01 19:57:30 kubernetes endpoint watch error {"component": "controller-runtime", "controller": "k8s.EndpointController", "error": "failed to list *v1.Endpoints: Get \"https://10.5.0.2:6443/api/v1/namespaces/default/endpoints?fieldSelector=metadata.name%3Dkubernetes&limit=500&resourceVersion=0\": dial tcp 10.5.0.2:6443: connect: connection refused"}
[talos] 2024/06/01 19:57:30 service[trustd](Running): Started task trustd (PID 126) for container trustd
[talos] 2024/06/01 19:57:30 service[apid](Running): Started task apid (PID 127) for container apid
[talos] 2024/06/01 19:57:31 service[etcd](Waiting): Waiting for service "cri" to be "up"
[talos] 2024/06/01 19:57:31 service[cri](Running): Health check successful
[talos] 2024/06/01 19:57:31 service[etcd](Preparing): Running pre state
[talos] 2024/06/01 19:57:31 service[kubelet](Preparing): Running pre state
[talos] 2024/06/01 19:57:31 service[apid](Running): Health check successful
[talos] 2024/06/01 19:57:31 service[trustd](Running): Health check successful
[talos] 2024/06/01 19:57:31 kubernetes endpoint watch error {"component": "controller-runtime", "controller": "k8s.EndpointController", "error": "failed to list *v1.Endpoints: Get \"https://10.5.0.2:6443/api/v1/namespaces/default/endpoints?fieldSelector=metadata.name%3Dkubernetes&limit=500&resourceVersion=0\": dial tcp 10.5.0.2:6443: connect: connection refused"}
[talos] 2024/06/01 19:57:34 kubernetes endpoint watch error {"component": "controller-runtime", "controller": "k8s.EndpointController", "error": "failed to list *v1.Endpoints: Get \"https://10.5.0.2:6443/api/v1/namespaces/default/endpoints?fieldSelector=metadata.name%3Dkubernetes&limit=500&resourceVersion=0\": dial tcp 10.5.0.2:6443: connect: connection refused"}
[talos] 2024/06/01 19:57:37 kubernetes endpoint watch error {"component": "controller-runtime", "controller": "k8s.EndpointController", "error": "failed to list *v1.Endpoints: Get \"https://10.5.0.2:6443/api/v1/namespaces/default/endpoints?fieldSelector=metadata.name%3Dkubernetes&limit=500&resourceVersion=0\": dial tcp 10.5.0.2:6443: connect: connection refused"}
[talos] 2024/06/01 19:57:40 controller failed {"component": "controller-runtime", "controller": "k8s.NodeApplyController", "error": "1 error(s) occurred:\n\ttimeout"}
[talos] task startAllServices (1/1): 2024/06/01 19:57:45 service "etcd" to be "up", service "kubelet" to be "up"
[talos] 2024/06/01 19:57:45 etcd is waiting to join the cluster, if this node is the first node in the cluster, please run `talosctl bootstrap` against one of the following IPs:
[talos] 2024/06/01 19:57:45 [10.5.0.2]
Collapse

I've debugged this to the --endpoint flag. If I don't set this, a cluster comes up.

Any thoughts on what is going on here?

talosctl v1.7.4 docker 25.0.2

smira commented 5 months ago

This might be a valid issue now with talosctl 1.7.x, as it changed the way ports are exposed. I guess it comes from Sidero Metal docs. Please check with that 8081 is exported correctly on the host, if it does, we can simply update Sidero Metal docs to drop the endpoint.

glitchcrab commented 2 months ago

@smira when you say 'Please check with that 8081 is exported correctly on the host', what exactly do you mean? I've just upgraded talosctl from a 1.6.x release to the latest 1.7 and I'm now seeing this issue too - my cluster create command does expose 8081 already though:

        ; talosctl cluster create --name sidero-bootstrap \
            -p 69:69/udp,8081:8081/tcp,51821:51821/udp --workers 0 --config-patch \
            '[{\"op\": \"add\", \"path\": \"/cluster/allowSchedulingOnMasters\", \"value\": true}]' \
            --endpoint 172.25.100.2 --nameservers 10.101.0.2,10.101.0.3 \
            --docker-host-ip 172.25.100.2 --memory 6144

Is this not the correct way of exposing the port?

smira commented 2 months ago

I mean try dropping --endpoint argument, use docker CLI to inspect how ports are exposed, if there's a bug there, please report an issue.