siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.
https://www.talos.dev
Mozilla Public License 2.0
6.83k stars 544 forks source link

Local docker cluster doesn't survive restart #4462

Closed TimJones closed 3 years ago

TimJones commented 3 years ago

Bug Report

Description

When creating a local talos cluster with docker, a restart of the host or containers causes the containers not to be able to start.

Logs

Create cluster:

❯ talosctl cluster create --name talos-docker-cluster-restart-test --workers 0
validating CIDR and reserving IPs
generating PKI and tokens
creating network talos-docker-cluster-restart-test
creating master nodes
creating worker nodes
waiting for API
bootstrapping cluster
waiting for etcd to be healthy: OK
waiting for apid to be ready: OK
waiting for kubelet to be healthy: OK
waiting for all nodes to finish boot sequence: OK
waiting for all k8s nodes to report: OK
waiting for all k8s nodes to report ready: OK
waiting for all control plane components to be ready: OK
waiting for kube-proxy to report ready: OK
waiting for coredns to report ready: OK
waiting for all k8s nodes to report schedulable: OK

merging kubeconfig into "/home/tim/.kube/config"
PROVISIONER       docker
NAME              talos-docker-cluster-restart-test
NETWORK NAME      talos-docker-cluster-restart-test
NETWORK CIDR      10.5.0.0/24
NETWORK GATEWAY   10.5.0.1
NETWORK MTU       1500

NODES:

NAME                                          TYPE           IP         CPU    RAM      DISK
/talos-docker-cluster-restart-test-master-1   controlplane   10.5.0.2   2.00   2.1 GB   -

Everything working & correct:

❯ docker ps                                                                   
CONTAINER ID   IMAGE                                 COMMAND        CREATED         STATUS         PORTS                                              NAMES
3d8a90120a83   ghcr.io/talos-systems/talos:v0.13.0   "/sbin/init"   7 minutes ago   Up 7 minutes   0.0.0.0:6443->6443/tcp, 0.0.0.0:50000->50000/tcp   talos-docker-cluster-restart-test-master-1

❯ docker logs talos-docker-cluster-restart-test-master-1 --tail 5 
[talos] 2021/10/28 12:56:44 created /v1/ConfigMap/kubeconfig-in-cluster {"component": "controller-runtime", "controller": "k8s.ManifestApplyController"}
[talos] 2021/10/28 12:56:46 controller failed {"component": "controller-runtime", "controller": "k8s.KubeletStaticPodController", "error": "error refreshing pod status: error fetching pod status: an error on the server (\"Authorization error (user=apiserver-kubelet-client, verb=get, resource=nodes, subresource=proxy)\") has prevented the request from succeeding"}
[talos] 2021/10/28 12:57:00 task labelNodeAsMaster (1/1): done, 39.714940432s
[talos] 2021/10/28 12:57:00 phase labelMaster (9/9): done, 39.714996934s
[talos] 2021/10/28 12:57:00 boot sequence: done: 1m0.159302729s

❯ kubectl get pods -A
NAMESPACE     NAME                                                                 READY   STATUS    RESTARTS        AGE
kube-system   coredns-6ff77786fb-5jdzz                                             1/1     Running   0               6m29s
kube-system   coredns-6ff77786fb-mzlmc                                             1/1     Running   0               6m29s
kube-system   kube-apiserver-talos-docker-cluster-restart-test-master-1            1/1     Running   0               5m22s
kube-system   kube-controller-manager-talos-docker-cluster-restart-test-master-1   1/1     Running   1 (6m45s ago)   5m26s
kube-system   kube-flannel-496c9                                                   1/1     Running   0               6m15s
kube-system   kube-proxy-vhcjs                                                     1/1     Running   0               6m15s
kube-system   kube-scheduler-talos-docker-cluster-restart-test-master-1            1/1     Running   1 (6m45s ago)   5m38s

Restart the container

❯ docker stop talos-docker-cluster-restart-test-master-1         
talos-docker-cluster-restart-test-master-1

❯ docker start talos-docker-cluster-restart-test-master-1 
talos-docker-cluster-restart-test-master-1

Note that we never reach boot sequence: done status after stop/start

❯ docker logs talos-docker-cluster-restart-test-master-1 --tail 5
[talos] 2021/10/28 13:04:18 controller failed {"component": "controller-runtime", "controller": "k8s.ManifestApplyController", "error": "error creating mapping for object /v1/Secret/bootstrap-token-b39i7a: Get \"https://localhost:6443/api?timeout=32s\": dial tcp 127.0.0.1:6443: connect: connection refused"}
[talos] 2021/10/28 13:04:18 controller failed {"component": "controller-runtime", "controller": "k8s.ManifestApplyController", "error": "error creating mapping for object /v1/Secret/bootstrap-token-b39i7a: Get \"https://localhost:6443/api?timeout=32s\": dial tcp 127.0.0.1:6443: connect: connection refused"}
[talos] 2021/10/28 13:04:20 service[kubelet](Running): Health check successful
[talos] 2021/10/28 13:04:22 controller failed {"component": "controller-runtime", "controller": "k8s.ManifestApplyController", "error": "error creating mapping for object rbac.authorization.k8s.io/v1/ClusterRoleBinding/system-bootstrap-approve-node-client-csr: no matches for kind \"ClusterRoleBinding\" in version \"rbac.authorization.k8s.io/v1\""}
[talos] 2021/10/28 13:04:22 Warning: policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+ {"component": "controller-runtime", "controller": "k8s.ManifestApplyController"}

After timeout (approx. 15mins) cluster no longer viable

❯ kubectl get pods -A
Unable to connect to the server: dial tcp 10.5.0.2:6443: connect: no route to host

❯ docker ps                                                      
CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES

Full logs of the container after restart

2021/10/28 13:04:11 initialize sequence: 4 phase(s)
2021/10/28 13:04:11 phase logger (1/4): 1 tasks(s)
2021/10/28 13:04:11 task setupLogger (1/1): starting
[talos] 2021/10/28 13:04:11 task setupLogger (1/1): done, 57.759µs
[talos] 2021/10/28 13:04:11 phase logger (1/4): done, 102.179µs
[talos] 2021/10/28 13:04:11 phase systemRequirements (2/4): 1 tasks(s)
[talos] 2021/10/28 13:04:11 task setupSystemDirectory (1/1): starting
[talos] 2021/10/28 13:04:11 task setupSystemDirectory (1/1): done, 28.077µs
[talos] 2021/10/28 13:04:11 phase systemRequirements (2/4): done, 53.778µs
[talos] 2021/10/28 13:04:11 phase etc (3/4): 2 tasks(s)
[talos] 2021/10/28 13:04:11 task createOSReleaseFile (2/2): starting
[talos] 2021/10/28 13:04:11 task CreateSystemCgroups (1/2): starting
[talos] 2021/10/28 13:04:11 task createOSReleaseFile (2/2): done, 244.655µs
[talos] 2021/10/28 13:04:11 node identity established {"component": "controller-runtime", "controller": "cluster.NodeIdentityController", "node_id": "bHodtDZpAnxsaVEl0efGlM0Zped81iZzEHdYnmt8bHgB"}
[talos] 2021/10/28 13:04:11 setting resolvers {"component": "controller-runtime", "controller": "network.ResolverSpecController", "resolvers": ["1.1.1.1", "8.8.8.8"]}
[talos] 2021/10/28 13:04:11 setting resolvers {"component": "controller-runtime", "controller": "network.ResolverSpecController", "resolvers": ["1.1.1.1", "8.8.8.8"]}
[talos] 2021/10/28 13:04:11 task CreateSystemCgroups (1/2): done, 27.2252ms
[talos] 2021/10/28 13:04:11 phase etc (3/4): done, 27.262705ms
[talos] 2021/10/28 13:04:11 phase config (4/4): 1 tasks(s)
[talos] 2021/10/28 13:04:11 task loadConfig (1/1): starting
[talos] task loadConfig (1/1): 2021/10/28 13:04:11 found existing config, but persistence is disabled, downloading config
[talos] 2021/10/28 13:04:11 fetching machine config from: USERDATA environment variable
[talos] task loadConfig (1/1): 2021/10/28 13:04:11 storing config in memory
[talos] 2021/10/28 13:04:11 task loadConfig (1/1): done, 16.449493ms
[talos] 2021/10/28 13:04:11 phase config (4/4): done, 16.491817ms
[talos] 2021/10/28 13:04:11 initialize sequence: done: 43.954758ms
[talos] 2021/10/28 13:04:11 install sequence: 0 phase(s)
[talos] 2021/10/28 13:04:11 install sequence: done: 11.105µs
[talos] 2021/10/28 13:04:11 boot sequence: 9 phase(s)
[talos] 2021/10/28 13:04:11 phase validateConfig (1/9): 1 tasks(s)
[talos] 2021/10/28 13:04:11 task validateConfig (1/1): starting
[talos] 2021/10/28 13:04:11 service[machined](Preparing): Running pre state
[talos] 2021/10/28 13:04:11 service[machined](Preparing): Creating service runner
[talos] 2021/10/28 13:04:11 service[machined](Running): Service started as goroutine
[talos] 2021/10/28 13:04:11 task validateConfig (1/1): done, 144.572µs
[talos] 2021/10/28 13:04:11 phase validateConfig (1/9): done, 199.537µs
[talos] 2021/10/28 13:04:11 phase saveConfig (2/9): 1 tasks(s)
[talos] 2021/10/28 13:04:11 task saveConfig (1/1): starting
[talos] 2021/10/28 13:04:11 setting resolvers {"component": "controller-runtime", "controller": "network.ResolverSpecController", "resolvers": ["8.8.8.8", "1.1.1.1", "2001:4860:4860::8888", "2606:4700:4700::1111"]}
[talos] 2021/10/28 13:04:11 task saveConfig (1/1): done, 6.187819ms
[talos] 2021/10/28 13:04:11 phase saveConfig (2/9): done, 6.230771ms
[talos] 2021/10/28 13:04:11 phase env (3/9): 1 tasks(s)
[talos] 2021/10/28 13:04:11 task setUserEnvVars (1/1): starting
[talos] 2021/10/28 13:04:11 task setUserEnvVars (1/1): done, 17.739µs
[talos] 2021/10/28 13:04:11 phase env (3/9): done, 40.997µs
[talos] 2021/10/28 13:04:11 phase containerd (4/9): 1 tasks(s)
[talos] 2021/10/28 13:04:11 task startContainerd (1/1): starting
[talos] 2021/10/28 13:04:11 service[containerd](Preparing): Running pre state
[talos] 2021/10/28 13:04:11 service[containerd](Preparing): Creating service runner
[talos] 2021/10/28 13:04:11 service[containerd](Running): Process Process(["/bin/containerd" "--address" "/system/run/containerd/containerd.sock" "--state" "/system/run/containerd" "--root" "/system/var/lib/containerd"]) started with PID 37
[talos] 2021/10/28 13:04:12 service[containerd](Running): Health check successful
[talos] 2021/10/28 13:04:12 task startContainerd (1/1): done, 1.003111908s
[talos] 2021/10/28 13:04:12 phase containerd (4/9): done, 1.003150181s
[talos] 2021/10/28 13:04:12 phase sharedFilesystems (5/9): 1 tasks(s)
[talos] 2021/10/28 13:04:12 task setupSharedFilesystems (1/1): starting
[talos] 2021/10/28 13:04:12 task setupSharedFilesystems (1/1): done, 55.803µs
[talos] 2021/10/28 13:04:12 phase sharedFilesystems (5/9): done, 88.14µs
[talos] 2021/10/28 13:04:12 phase var (6/9): 1 tasks(s)
[talos] 2021/10/28 13:04:12 task setupVarDirectory (1/1): starting
[talos] 2021/10/28 13:04:12 task setupVarDirectory (1/1): done, 70.4µs
[talos] 2021/10/28 13:04:12 phase var (6/9): done, 100.851µs
[talos] 2021/10/28 13:04:12 phase userSetup (7/9): 1 tasks(s)
[talos] 2021/10/28 13:04:12 task writeUserFiles (1/1): starting
[talos] 2021/10/28 13:04:12 task writeUserFiles (1/1): done, 274.896µs
[talos] 2021/10/28 13:04:12 phase userSetup (7/9): done, 305.696µs
[talos] 2021/10/28 13:04:12 phase startEverything (8/9): 1 tasks(s)
[talos] 2021/10/28 13:04:12 task startAllServices (1/1): starting
[talos] task startAllServices (1/1): 2021/10/28 13:04:12 waiting for 7 services
[talos] 2021/10/28 13:04:12 service[etcd](Waiting): Waiting for service "cri" to be "up", time sync, network
[talos] 2021/10/28 13:04:12 service[apid](Waiting): Waiting for service "containerd" to be "up", api certificates
[talos] 2021/10/28 13:04:12 service[kubelet](Waiting): Waiting for service "cri" to be "up", time sync, network, nodename
[talos] 2021/10/28 13:04:12 service[cri](Waiting): Waiting for network
[talos] 2021/10/28 13:04:12 service[trustd](Waiting): Waiting for service "containerd" to be "up", time sync, network
[talos] 2021/10/28 13:04:12 service[apid](Preparing): Running pre state
[talos] 2021/10/28 13:04:12 service[cri](Preparing): Running pre state
[talos] 2021/10/28 13:04:12 service[cri](Preparing): Creating service runner
[talos] 2021/10/28 13:04:12 service[trustd](Preparing): Running pre state
[talos] 2021/10/28 13:04:12 service[apid](Failed): Failed to run pre stage: listen unix /system/run/apid/runtime.sock: bind: address already in use
[talos] 2021/10/28 13:04:12 service[trustd](Preparing): Creating service runner
[talos] 2021/10/28 13:04:12 service[cri](Running): Process Process(["/bin/containerd" "--address" "/run/containerd/containerd.sock" "--config" "/etc/cri/containerd.toml"]) started with PID 90
[talos] 2021/10/28 13:04:12 service[trustd](Running): Started task trustd (PID 148) for container trustd
[talos] 2021/10/28 13:04:13 service[etcd](Waiting): Waiting for service "cri" to be "up"
[talos] 2021/10/28 13:04:13 service[kubelet](Waiting): Waiting for service "cri" to be "up"
[talos] 2021/10/28 13:04:13 service[cri](Running): Health check successful
[talos] 2021/10/28 13:04:13 service[etcd](Preparing): Running pre state
[talos] 2021/10/28 13:04:13 service[kubelet](Preparing): Running pre state
[talos] 2021/10/28 13:04:13 service[etcd](Preparing): Creating service runner
[talos] 2021/10/28 13:04:13 service[kubelet](Preparing): Creating service runner
[talos] 2021/10/28 13:04:13 service[trustd](Running): Health check successful
[talos] 2021/10/28 13:04:13 service[kubelet](Running): Started task kubelet (PID 319) for container kubelet
[talos] 2021/10/28 13:04:13 cleaning up static pod "/etc/kubernetes/manifests/talos-kube-controller-manager.yaml" {"component": "controller-runtime", "controller": "k8s.KubeletStaticPodController"}
[talos] 2021/10/28 13:04:13 cleaning up static pod "/etc/kubernetes/manifests/talos-kube-apiserver.yaml" {"component": "controller-runtime", "controller": "k8s.KubeletStaticPodController"}
[talos] 2021/10/28 13:04:13 cleaning up static pod "/etc/kubernetes/manifests/talos-kube-scheduler.yaml" {"component": "controller-runtime", "controller": "k8s.KubeletStaticPodController"}
[talos] 2021/10/28 13:04:13 service[etcd](Running): Started task etcd (PID 318) for container etcd
[talos] 2021/10/28 13:04:15 service[kubelet](Running): Health check failed: Get "http://127.0.0.1:10248/healthz": dial tcp 127.0.0.1:10248: connect: connection refused
[talos] 2021/10/28 13:04:18 service[etcd](Running): Health check successful
[talos] 2021/10/28 13:04:18 writing static pod "/etc/kubernetes/manifests/talos-kube-apiserver.yaml" {"component": "controller-runtime", "controller": "k8s.KubeletStaticPodController"}
[talos] 2021/10/28 13:04:18 writing static pod "/etc/kubernetes/manifests/talos-kube-controller-manager.yaml" {"component": "controller-runtime", "controller": "k8s.KubeletStaticPodController"}
[talos] 2021/10/28 13:04:18 writing static pod "/etc/kubernetes/manifests/talos-kube-scheduler.yaml" {"component": "controller-runtime", "controller": "k8s.KubeletStaticPodController"}
[talos] 2021/10/28 13:04:18 controller failed {"component": "controller-runtime", "controller": "k8s.ManifestApplyController", "error": "error creating mapping for object /v1/Secret/bootstrap-token-b39i7a: Get \"https://localhost:6443/api?timeout=32s\": dial tcp 127.0.0.1:6443: connect: connection refused"}
[talos] 2021/10/28 13:04:18 controller failed {"component": "controller-runtime", "controller": "k8s.ManifestApplyController", "error": "error creating mapping for object /v1/Secret/bootstrap-token-b39i7a: Get \"https://localhost:6443/api?timeout=32s\": dial tcp 127.0.0.1:6443: connect: connection refused"}
[talos] 2021/10/28 13:04:20 service[kubelet](Running): Health check successful
[talos] 2021/10/28 13:04:22 controller failed {"component": "controller-runtime", "controller": "k8s.ManifestApplyController", "error": "error creating mapping for object rbac.authorization.k8s.io/v1/ClusterRoleBinding/system-bootstrap-approve-node-client-csr: no matches for kind \"ClusterRoleBinding\" in version \"rbac.authorization.k8s.io/v1\""}
[talos] 2021/10/28 13:04:22 Warning: policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+ {"component": "controller-runtime", "controller": "k8s.ManifestApplyController"}
[talos] 2021/10/28 13:19:12 task startAllServices (1/1): failed: 1 error occurred:
    * context deadline exceeded

[talos] 2021/10/28 13:19:12 phase startEverything (8/9): failed
[talos] 2021/10/28 13:19:12 boot sequence: failed
[talos] 2021/10/28 13:19:12 service[trustd](Stopping): Sending SIGTERM to task trustd (PID 148, container trustd)
[talos] 2021/10/28 13:19:12 service[kubelet](Stopping): Sending SIGTERM to task kubelet (PID 319, container kubelet)
[talos] 2021/10/28 13:19:12 service[etcd](Stopping): Sending SIGTERM to task etcd (PID 318, container etcd)
[talos] 2021/10/28 13:19:12 service[machined](Finished): Service finished successfully
[talos] 2021/10/28 13:19:12 service[etcd](Finished): Service finished successfully
[talos] 2021/10/28 13:19:12 service[trustd](Finished): Service finished successfully
[talos] 2021/10/28 13:19:12 service[containerd](Stopping): Sending SIGTERM to Process(["/bin/containerd" "--address" "/system/run/containerd/containerd.sock" "--state" "/system/run/containerd" "--root" "/system/var/lib/containerd"])
[talos] 2021/10/28 13:19:12 service[containerd](Finished): Service finished successfully
[talos] 2021/10/28 13:19:12 service[kubelet](Finished): Service finished successfully
[talos] 2021/10/28 13:19:12 service[cri](Stopping): Sending SIGTERM to Process(["/bin/containerd" "--address" "/run/containerd/containerd.sock" "--config" "/etc/cri/containerd.toml"])
[talos] 2021/10/28 13:19:12 service[cri](Finished): Service finished successfully
[talos] 2021/10/28 13:19:12 error running phase 8 in boot sequence: task 1/1: failed, 1 error occurred:
    * context deadline exceeded

[talos] 2021/10/28 13:19:12 controller runtime finished
[talos] 2021/10/28 13:19:12 failed to open meta: file does not exist
[talos] 2021/10/28 13:19:12 rebooting in 10 seconds
[talos] 2021/10/28 13:19:13 rebooting in 9 seconds
[talos] 2021/10/28 13:19:14 rebooting in 8 seconds
[talos] 2021/10/28 13:19:15 rebooting in 7 seconds
[talos] 2021/10/28 13:19:16 rebooting in 6 seconds
[talos] 2021/10/28 13:19:17 rebooting in 5 seconds
[talos] 2021/10/28 13:19:18 rebooting in 4 seconds
[talos] 2021/10/28 13:19:19 rebooting in 3 seconds
[talos] 2021/10/28 13:19:20 rebooting in 2 seconds
[talos] 2021/10/28 13:19:21 rebooting in 1 seconds
[talos] 2021/10/28 13:19:22 rebooting in 0 seconds
[talos] 2021/10/28 13:19:23 killed 21 procs with terminated
[talos] 2021/10/28 13:19:24 waiting for 11 processes to terminate
[talos] 2021/10/28 13:19:25 waiting for 11 processes to terminate
[talos] 2021/10/28 13:19:26 waiting for 11 processes to terminate
[talos] 2021/10/28 13:19:27 waiting for 11 processes to terminate
[talos] 2021/10/28 13:19:28 waiting for 11 processes to terminate
[talos] 2021/10/28 13:19:29 waiting for 9 processes to terminate
[talos] 2021/10/28 13:19:30 waiting for 9 processes to terminate
[talos] 2021/10/28 13:19:31 waiting for 9 processes to terminate
[talos] 2021/10/28 13:19:32 waiting for 9 processes to terminate
[talos] 2021/10/28 13:19:33 killed 9 procs with killed
[talos] 2021/10/28 13:19:34 failed unmounting /run: device or resource busy
[talos] 2021/10/28 13:19:34 failed unmounting /system: device or resource busy
[talos] 2021/10/28 13:19:34 unmounted /etc/cni (/dev/nvme0n1p3)
[talos] 2021/10/28 13:19:34 unmounted /etc/resolv.conf (/dev/nvme0n1p3)
[talos] 2021/10/28 13:19:34 unmounted /etc/hostname (/dev/nvme0n1p3)
[talos] 2021/10/28 13:19:34 unmounted /etc/hosts (/dev/nvme0n1p3)
[talos] 2021/10/28 13:19:34 unmounted /var/lib/containerd (/dev/nvme0n1p3)
[talos] 2021/10/28 13:19:34 unmounted /var/lib/etcd (/dev/nvme0n1p3)
[talos] 2021/10/28 13:19:34 failed unmounting /var/lib/kubelet: device or resource busy
[talos] 2021/10/28 13:19:34 unmounted /etc/os-release (/dev/nvme0n1p3)
[talos] 2021/10/28 13:19:34 unmounted /etc/resolv.conf (/dev/nvme0n1p3)
[talos] 2021/10/28 13:19:34 unmounted /etc/hosts (/dev/nvme0n1p3)
[talos] 2021/10/28 13:19:34 retrying 3 unmount operations...
[talos] 2021/10/28 13:19:35 failed unmounting /run: device or resource busy
[talos] 2021/10/28 13:19:35 failed unmounting /system: device or resource busy
[talos] 2021/10/28 13:19:35 failed unmounting /var/lib/kubelet: device or resource busy
[talos] 2021/10/28 13:19:35 retrying 3 unmount operations...
[talos] 2021/10/28 13:19:36 failed unmounting /run: device or resource busy
[talos] 2021/10/28 13:19:36 failed unmounting /system: device or resource busy
[talos] 2021/10/28 13:19:36 failed unmounting /var/lib/kubelet: device or resource busy
[talos] 2021/10/28 13:19:36 retrying 3 unmount operations...
[talos] 2021/10/28 13:19:37 failed unmounting /run: device or resource busy
[talos] 2021/10/28 13:19:37 failed unmounting /system: device or resource busy
[talos] 2021/10/28 13:19:37 failed unmounting /var/lib/kubelet: device or resource busy
[talos] 2021/10/28 13:19:37 retrying 3 unmount operations...
[talos] 2021/10/28 13:19:38 failed unmounting /run: device or resource busy
[talos] 2021/10/28 13:19:38 failed unmounting /system: device or resource busy
[talos] 2021/10/28 13:19:38 failed unmounting /var/lib/kubelet: device or resource busy
[talos] 2021/10/28 13:19:38 retrying 3 unmount operations...
[talos] 2021/10/28 13:19:39 failed unmounting /run: device or resource busy
[talos] 2021/10/28 13:19:39 failed unmounting /system: device or resource busy
[talos] 2021/10/28 13:19:39 failed unmounting /var/lib/kubelet: device or resource busy
[talos] 2021/10/28 13:19:39 retrying 3 unmount operations...
[talos] 2021/10/28 13:19:40 failed unmounting /run: device or resource busy
[talos] 2021/10/28 13:19:40 failed unmounting /system: device or resource busy
[talos] 2021/10/28 13:19:40 failed unmounting /var/lib/kubelet: device or resource busy
[talos] 2021/10/28 13:19:40 retrying 3 unmount operations...
[talos] 2021/10/28 13:19:41 failed unmounting /run: device or resource busy
[talos] 2021/10/28 13:19:41 failed unmounting /system: device or resource busy
[talos] 2021/10/28 13:19:41 failed unmounting /var/lib/kubelet: device or resource busy
[talos] 2021/10/28 13:19:41 retrying 3 unmount operations...
[talos] 2021/10/28 13:19:42 failed unmounting /run: device or resource busy
[talos] 2021/10/28 13:19:42 failed unmounting /system: device or resource busy
[talos] 2021/10/28 13:19:42 failed unmounting /var/lib/kubelet: device or resource busy
[talos] 2021/10/28 13:19:42 retrying 3 unmount operations...
[talos] 2021/10/28 13:19:43 failed unmounting /run: device or resource busy
[talos] 2021/10/28 13:19:43 failed unmounting /system: device or resource busy
[talos] 2021/10/28 13:19:43 failed unmounting /var/lib/kubelet: device or resource busy
[talos] 2021/10/28 13:19:43 retrying 3 unmount operations...
[talos] 2021/10/28 13:19:44 waiting for sync...
[talos] 2021/10/28 13:19:44 sync done

It looks as if apid is failing to start because the socket wasn't cleaned up on stop.

[talos] 2021/10/28 13:04:12 service[apid](Failed): Failed to run pre stage: listen unix /system/run/apid/runtime.sock: bind: address already in use

Environment

smira commented 3 years ago

I believe the problem is that with docker /system volume survives a "reboot" (container restart) and apid can't bind anymore.

TimJones commented 3 years ago

Just for those that might need it before the fix is released, a work-around is:

# rm $(docker inspect talos-docker-cluster-restart-test-master-1 --format '{{ range .Mounts }}{{ if eq .Destination "/system" }}{{ .Source }}{{ end }}{{ end }}')/run/apid/{apid,runtime}.sock

Remember to change the talos-docker-cluster-restart-test-master-1 container name for the container(s) running talos you want/need to restart.