siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.
https://www.talos.dev
Mozilla Public License 2.0
6.7k stars 537 forks source link

Unable to use `talosctl cluster create` reliably on nixos #9569

Open bbigras opened 2 hours ago

bbigras commented 2 hours ago

Bug Report

Description

I don't seem able to create a cluster with talosctl cluster create every single time. Sometimes it works after a reboot, but maybe it's just random.

It seems to fail before the default 20m0s timeout from --wait-timeout duration.

Logs

❯ talosctl cluster create
validating CIDR and reserving IPs
generating PKI and tokens
creating network talos-default
creating controlplane nodes
creating worker nodes
renamed talosconfig context "talos-default" -> "talos-default-3"
waiting for API
bootstrapping cluster
waiting for etcd to be healthy: OK
waiting for etcd members to be consistent across nodes: OK
waiting for etcd members to be control plane nodes: OK
waiting for apid to be ready: OK
waiting for all nodes memory sizes: OK
waiting for all nodes disk sizes: OK
waiting for no diagnostics: OK
waiting for kubelet to be healthy: OK
waiting for all nodes to finish boot sequence: OK
waiting for all k8s nodes to report: OK
waiting for all control plane static pods to be running: OK
waiting for all control plane components to be ready: OK
waiting for all k8s nodes to report ready: OK
waiting for kube-proxy to report ready: OK
waiting for coredns to report ready: OK
waiting for all k8s nodes to report schedulable: OK

merging kubeconfig into "/home/bbigras/.kube/config"
renamed cluster "talos-default" -> "talos-default-2"
renamed auth info "admin@talos-default" -> "admin@talos-default-2"
renamed context "admin@talos-default" -> "admin@talos-default-2"
PROVISIONER           docker
NAME                  talos-default
NETWORK NAME          talos-default
NETWORK CIDR          10.5.0.0/24
NETWORK GATEWAY       10.5.0.1
NETWORK MTU           1500
KUBERNETES ENDPOINT   https://127.0.0.1:32999

NODES:

NAME                            TYPE           IP         CPU    RAM      DISK
/talos-default-controlplane-1   controlplane   10.5.0.2   2.00   2.1 GB   -
/talos-default-worker-1         worker         10.5.0.3   2.00   2.1 GB   -
❯ talosctl cluster destroy
destroying node talos-default-controlplane-1
destroying node talos-default-worker-1
destroying network talos-default
❯ talosctl cluster create
validating CIDR and reserving IPs
generating PKI and tokens
creating network talos-default
creating controlplane nodes
creating worker nodes
renamed talosconfig context "talos-default" -> "talos-default-4"
waiting for API
bootstrapping cluster
waiting for etcd to be healthy: OK
waiting for etcd members to be consistent across nodes: OK
waiting for etcd members to be control plane nodes: OK
waiting for apid to be ready: OK
waiting for all nodes memory sizes: OK
waiting for all nodes disk sizes: OK
waiting for no diagnostics: OK
waiting for kubelet to be healthy: OK
waiting for all nodes to finish boot sequence: OK
waiting for all k8s nodes to report: OK
waiting for all control plane static pods to be running: OK
waiting for all control plane components to be ready: OK
waiting for all k8s nodes to report ready: OK
waiting for kube-proxy to report ready: OK
◱ waiting for coredns to report ready: no ready pods found for namespace "kube-system" and label selector "k8s-app=kube-dns"
context deadline exceeded

Environment

smira commented 2 hours ago

You just need to dig into that further to understand what's wrong, might be unrelated to NixOS.

Fetch kubeconfig using talosctl kubeconfig, and after that do kubectl get pods and figure out why CoreDNS is not ready. Nothing Talos-specific here, just regular Kubernetes troubleshooting.

You can access Kubernetes API as soon as waiting for all k8s nodes to report ready: OK check completes (you can ^C the cluster create, is just runs the health check and doesn't do anything).

bbigras commented 2 hours ago
Warning  FailedScheduling  2m37s               default-scheduler  0/2 nodes are available: 2 node(s) had untolerated taint {node.kubernetes.io/disk-pressure: }. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling.
smira commented 2 hours ago

This is a bit of the way Kubernetes works mixed with the way Docker works.

In the end your Kubernetes in a docker sees a diskfree from the host partition where docker directory is. So if that partition is low on disk space (overall), it would stop scheduling pods. (As the host partition is checked for percent free, it might need way more than it actually needs).

Make sure you have enough space, and you should be good!