siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.
https://www.talos.dev
Mozilla Public License 2.0
6.62k stars 530 forks source link

[1.8.0] coredns does not start with talosctl cluster create #9419

Open alongwill opened 1 week ago

alongwill commented 1 week ago

Bug Report

Description

When creating a cluster with talosctl and docker, coredns does not start

talosctl cluster create
View logs
talosctl cluster create
validating CIDR and reserving IPs
generating PKI and tokens
creating network talos-default
creating controlplane nodes
creating worker nodes
waiting for API
bootstrapping cluster
waiting for etcd to be healthy: OK
waiting for etcd members to be consistent across nodes: OK
waiting for etcd members to be control plane nodes: OK
waiting for apid to be ready: OK
waiting for all nodes memory sizes: OK
waiting for all nodes disk sizes: OK
waiting for no diagnostics: OK
waiting for kubelet to be healthy: OK
waiting for all nodes to finish boot sequence: OK
waiting for all k8s nodes to report: OK
waiting for all control plane static pods to be running: OK
β—± waiting for all control plane components to be ready: expected number of pods for kube-scheduler to be 1, got 0
waiting for all control plane components to be ready: OK
waiting for all k8s nodes to report ready: OK
waiting for kube-proxy to report ready: OK
β—³ waiting for coredns to report ready: no ready pods found for namespace "kube-system" and label selector "k8s-app=kube-dns"
context deadline exceeded

Logs

coredns Pods are not ready.

View pod statuses
kubectl get pods -A
NAMESPACE     NAME                                                   READY   STATUS    RESTARTS      AGE
kube-system   coredns-68d75fd545-bwtqm                               0/1     Running   0             12m
kube-system   coredns-68d75fd545-m27zf                               0/1     Running   0             12m
kube-system   kube-apiserver-talos-default-controlplane-1            1/1     Running   0             12m
kube-system   kube-controller-manager-talos-default-controlplane-1   1/1     Running   2 (13m ago)   11m
kube-system   kube-flannel-hlghk                                     1/1     Running   0             12m
kube-system   kube-flannel-zrgsw                                     1/1     Running   0             12m
kube-system   kube-proxy-6kfvc                                       1/1     Running   0             12m
kube-system   kube-proxy-fsxf9                                       1/1     Running   0             12m
kube-system   kube-scheduler-talos-default-controlplane-1            1/1     Running   2 (13m ago)   11m

coredns Pod logs

View coredns pod logs
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.29.3/tools/cache/reflector.go:229: failed to list *v1.EndpointSlice: Get "https://10.96.0.1:443/apis/discovery.k8s.io/v1/endpointslices?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[INFO] plugin/kubernetes: Trace[2008292454]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.29.3/tools/cache/reflector.go:229 (02-Oct-2024 09:42:02.185) (total time: 30003ms):
Trace[2008292454]: ---"Objects listed" error:Get "https://10.96.0.1:443/apis/discovery.k8s.io/v1/endpointslices?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout 30002ms (09:42:32.187)
Trace[2008292454]: [30.003012847s] [30.003012847s] END
[ERROR] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.29.3/tools/cache/reflector.go:229: Failed to watch *v1.EndpointSlice: failed to list *v1.EndpointSlice: Get "https://10.96.0.1:443/apis/discovery.k8s.io/v1/endpointslices?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.29.3/tools/cache/reflector.go:229: failed to list *v1.Namespace: Get "https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[INFO] plugin/kubernetes: Trace[219620330]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.29.3/tools/cache/reflector.go:229 (02-Oct-2024 09:42:03.343) (total time: 30003ms):
Trace[219620330]: ---"Objects listed" error:Get "https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout 30003ms (09:42:33.346)
Trace[219620330]: [30.003374722s] [30.003374722s] END
[ERROR] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.29.3/tools/cache/reflector.go:229: Failed to watch *v1.Namespace: failed to list *v1.Namespace: Get "https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.29.3/tools/cache/reflector.go:229: failed to list *v1.Service: Get "https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[INFO] plugin/kubernetes: Trace[1005043231]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.29.3/tools/cache/reflector.go:229 (02-Oct-2024 09:42:18.890) (total time: 30004ms):
Trace[1005043231]: ---"Objects listed" error:Get "https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout 30004ms (09:42:48.895)
Trace[1005043231]: [30.00420643s] [30.00420643s] END
[ERROR] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.29.3/tools/cache/reflector.go:229: Failed to watch *v1.Service: failed to list *v1.Service: Get "https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
View coredns pod events
kubectl describe -n kube-system po coredns-68d75fd545-bwtqm
...
Events:
  Type     Reason                  Age                  From               Message
  ----     ------                  ----                 ----               -------
  Warning  FailedScheduling        16m                  default-scheduler  no nodes available to schedule pods
  Warning  FailedScheduling        16m                  default-scheduler  0/1 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/not-ready: }. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.
  Normal   Scheduled               15m                  default-scheduler  Successfully assigned kube-system/coredns-68d75fd545-bwtqm to talos-default-worker-1
  Warning  FailedCreatePodSandBox  15m                  kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "73940f0a8b5724ae1fe13c750a886d2a0e9c39ac254ee82c6cc08462dee1cb3f": plugin type="flannel" failed (add): loadFlannelSubnetEnv failed: open /run/flannel/subnet.env: no such file or directory
  Normal   Pulling                 15m                  kubelet            Pulling image "registry.k8s.io/coredns/coredns:v1.11.3"
  Normal   Pulled                  15m                  kubelet            Successfully pulled image "registry.k8s.io/coredns/coredns:v1.11.3" in 5.508s (5.508s including waiting). Image size: 16948420 bytes.
  Normal   Created                 15m                  kubelet            Created container coredns
  Normal   Started                 15m                  kubelet            Started container coredns
  Warning  Unhealthy               41s (x103 over 15m)  kubelet            Readiness probe failed: HTTP probe failed with statuscode: 503

Environment

Server: Docker Engine - Community Engine: Version: 20.10.7 API version: 1.41 (minimum version 1.12) Go version: go1.13.15 Git commit: b0f5bc3 Built: Wed Jun 2 11:54:48 2021 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.4.6 GitCommit: d71fcd7d8303cbf684402823e425e9dd2e99285d runc: Version: 1.0.0-rc95 GitCommit: b9ee9c6314599f1b4a7f497e1f1f856fe433d3b7 docker-init: Version: 0.19.0 GitCommit: de40ad0

smira commented 1 week ago

@alongwill can you please check kube-proxy logs to see if it complains about nftables, I wonder Docker Desktop Linux kernel is outdated.

20.10.7 is a version released 3 years ago!

alongwill commented 1 week ago

@smira kube-proxy logs show the following nftables errors:

I1002 10:10:35.532785       1 proxier.go:1180] "Sync failed" ipFamily="IPv6" retryingTime="30s"
E1002 10:10:58.514596       1 proxier.go:1806] "nftables sync failed" err=<
    /dev/stdin:78:70-88: Error: Could not process rule: No such file or directory
    add rule ip kube-proxy service-2QRHZV4L-default/kubernetes/tcp/https numgen random mod 1 vmap { 0 : goto endpoint-MGIKEWY5-default/kubernetes/tcp/https__10.5.0.2/6443 }
                                                                         ^^^^^^^^^^^^^^^^^^^
 > ipFamily="IPv4"

Apologies 🀦 I pasted the docker version from my linux machine where I was testing this at the same time and it worked. The version on my Mac is as follows:

docker version
Client:
 Version:           27.2.0
 API version:       1.47
 Go version:        go1.21.13
 Git commit:        3ab4256
 Built:             Tue Aug 27 14:14:45 2024
 OS/Arch:           darwin/arm64
 Context:           desktop-linux

Server: Docker Desktop 4.34.2 (167172)
 Engine:
  Version:          27.2.0
  API version:      1.47 (minimum version 1.24)
  Go version:       go1.21.13
  Git commit:       3ab5c7d
  Built:            Tue Aug 27 14:15:41 2024
  OS/Arch:          linux/arm64
  Experimental:     false
 containerd:
  Version:          1.7.20
  GitCommit:        8fc6bcff51318944179630522a095cc9dbf9f353
 runc:
  Version:          1.1.13
  GitCommit:        v1.1.13-0-g58aa920
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

N.B. Creating a cluster on my mac with the same docker version and talosctl 1.7.6 works.

Summary: Arch talosctl Working?
🍎 arm64 1.7.6 βœ…
🍎 arm64 1.8.0 ❌
🐧 x86_64 1.8.0 βœ…
smira commented 1 week ago

yeah, I guess not much we can do if we want to stay consistent, but Kubernetes 1.31 defaults to nftables backend in Talos

so you can bring up 1.8.0 with Kubernetes 1.30 for example, and it would probably work, but this is something beyond our control (Docker Desktop kernel).