matti commented 3 years ago

Version

v0.8.1

Platform

No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.1 LTS
Release:    20.04
Codename:   focal

What happened? Worker started, shows up as a node in master. Then I added crontab

@reboot screen -dmS k0s k0s worker --token-file /root/join-token

and then rebooted and it's not coming up anymore, see https://gist.githubusercontent.com/matti/f24e0f0080298e79d7c2c9e4500b5a89/raw/ae42398487051ccb7c3bd48c1ca5f153c6545cea/k0s-worker.txt

jnummelin commented 3 years ago

the "root cause" seems to be

grpc: addrConn.createTransport failed to connect to {/var/lib/k0s/run/containerd.sock  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /var/lib/k0s/run/containerd.sock: connect: no such file or directory". Reconnecting...  component=kubelet

If you look at the processes, I'd guess that containerd is not really running?

Before kubelet getting into crashloop, do you see anything in logs saying why containerd cannot start?

matti commented 3 years ago

nope, tried to paste as much log as possible

kstych commented 3 years ago

hi. same, fedora 33, node not coming up after reboot ( /var/lib/k0s/run/containerd.sock not such file : log attached from start to stop) k0s.txt

paveq commented 3 years ago

Having same issue on Raspberry PI 4. When creating new cluster it works, at least for while. After scheduling some pods, it started crashing like this.

Feels like it could be also race condition: RPI 4 (without much cooling, and running master along with worker) can get a bit slow.

jnummelin commented 3 years ago

In the logs from @kstych I see the following:

time="2020-12-19 15:53:08" level=info msg="W1219 15:53:08.863977    2860 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {/var/lib/k0s/run/containerd.sock  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial unix /var/lib/k0s/run/containerd.sock: connect: no such file or directory\". Reconnecting..." component=kubelet
time="2020-12-19 15:53:09" level=info msg="Shutting down pid 2860" component=kubelet
time="2020-12-19 15:53:09" level=info msg="Shutting down pid 2718" component=containerd
time="2020-12-19 15:53:10" level=info msg="E1219 15:53:10.979982    2662 resource_quota_controller.go:408] unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request" component=kube-controller-manager
time="2020-12-19 15:53:12" level=info msg="W1219 15:53:12.213435    2662 garbagecollector.go:642] failed to discover some groups: map[metrics.k8s.io/v1beta1:the server is currently unable to handle the request]" component=kube-controller-manager
time="2020-12-19 15:53:14" level=info msg="Shutting down pid 2718" component=containerd

And naturally after containerd is down and kubelet thus busted, it's a quick slide downhill.

I do not see containerd logging anything useful as to the reason why it's shutting down.

trawler commented 3 years ago

hi. same, fedora 33, node not coming up after reboot ( /var/lib/k0s/run/containerd.sock not such file : log attached from start to stop) k0s.txt

Hi @kstych. Can you verify if /var/lib/k0s/run/containerd.sock exists? Alternatively, can you see if the following flags provide more output that we can use for debugging?

k0s worker --token-file <file> --debug --logging containerd=debug

kstych commented 3 years ago

hi @trawler sure, please find attached here running single node cluster now same result. install works fine on the first time everything comes up in about 2 min

after stopping k0s (CTRL+C) and reboot and same command. the node becomes NotReady (unreachable)

command : k0s server -c ${HOME}/.k0s/k0s.yaml --enable-worker --debug --logging containerd=debug also the file doesnot exists as in the error : /var/lib/k0s/run/containerd.sock

k0s.txt

jnummelin commented 3 years ago

@kstych Are the logs only from the time after the reboot?

So based on the logs it looks like the order of things happening is:

containerd started succesfully

time="2020-12-21 14:20:45" level=info msg="Started succesfully, go nuts" component=containerd

containerd does some cleanup:

"time=\"2020-12-21T14:20:48.198133019Z\" level=debug msg=\"garbage collected\" d=785.660158ms" component=containerd

kubelet complain on missing containerd.sock and starts to crash

time="2020-12-21 14:21:01" level=info msg="W1221 14:21:01.670147    1755 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {/var/lib/k0s/run/containerd.sock  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial unix /var/lib/k0s/run/containerd.sock: connect: no such file or directory\". Reconnecting..." component=kubelet

at this point containerd seems to do something still:


...
time=\"2020-12-21T14:22:27.879723672Z\" level=warning msg=\"cleaning up after shim disconnected\" id=032766e608a5729f0b27a4478f84034761b3c3e5383d58ca7dc12a090607c561 namespace=k8s.io" component=containerd
...
time="2020-12-21 14:22:28" level=info msg="time=\"2020-12-21T14:22:28.503107439Z\" level=warning msg=\"cleanup warnings time=\\\"2020-12-21T14:22:28Z\\\" level=info msg=\\\"starting signal loop\\\" namespace=k8s.io pid=1984\\n\"" component=containerd
time="2020-12-21 14:22:28" level=info msg="time=\"2020-12-21T14:22:28.523903086Z\" level=debug msg=\"event published\" ns=k8s.io topic=/tasks/exit type=containerd.events.TaskExit" component=containerd
time="2020-12-21 14:22:28" level=info msg="time=\"2020-12-21T14:22:28.523936035Z\" level=debug msg=\"event published\" ns=k8s.io topic=/tasks/delete type=containerd.events.TaskDelete" component=containerd



So based on the logs, it seems like containerd itself _might_ be up-and-running, just the socket missing. I cannot find anything in the containerd log entries that hint it being anyway broken. 🤔 

So what would be interesting to see is:
- is the `/var/lib/k0s/run/containerd.sock`  really existing or not?
- is containerd really listening on it?

kstych commented 3 years ago

hi @jnummelin yes i can see the process containerd running but there is no sock file

this is the only matching file across the filesystem

[root@k8s /]# find . | grep containerd.sock
./var/lib/k0s/run/containerd.sock.ttrpc

after cleaning up everything

cd /var/lib ; rm -rf calico cni k0s kubelet 
cd ~ ; rm -rf .k0s .kube

and re running single node command, now there is containerd.sock

[root@k8s ~]# cd /
[root@k8s /]# find . | grep containerd.sock
./var/lib/k0s/run/containerd.sock.ttrpc
./var/lib/k0s/run/containerd.sock

wait for the pods

[root@k8s /]# kubectl get node,pods -A
NAME                  STATUS   ROLES    AGE     VERSION
node/k8s.kstych.com   Ready    <none>   2m17s   v1.19.4

NAMESPACE     NAME                                           READY   STATUS    RESTARTS   AGE
kube-system   pod/calico-kube-controllers-5f6546844f-ttnfz   1/1     Running   0          2m40s
kube-system   pod/calico-node-4w2cm                          1/1     Running   0          69s
kube-system   pod/coredns-5c98d7d4d8-5tp4d                   1/1     Running   0          2m46s
kube-system   pod/konnectivity-agent-dwbc5                   1/1     Running   0          2m12s
kube-system   pod/kube-proxy-ptvtj                           1/1     Running   0          2m17s

then reboot, run the same command

after a while node is NotReady and there is no containerd.sock

[root@k8s /]# kubectl get node,pods -A
NAME                  STATUS     ROLES    AGE     VERSION
node/k8s.kstych.com   NotReady   <none>   7m32s   v1.19.4

NAMESPACE     NAME                                           READY   STATUS    RESTARTS   AGE
kube-system   pod/calico-kube-controllers-5f6546844f-ttnfz   1/1     Running   0          7m55s
kube-system   pod/calico-node-4w2cm                          1/1     Running   0          6m24s
kube-system   pod/coredns-5c98d7d4d8-5tp4d                   1/1     Running   0          8m1s
kube-system   pod/konnectivity-agent-dwbc5                   1/1     Running   0          7m27s
kube-system   pod/kube-proxy-ptvtj                           1/1     Running   0          7m32s

[root@k8s /]# find . | grep containerd.sock
./var/lib/k0s/run/containerd.sock.ttrpc

jnummelin commented 3 years ago

@kstych do you have k0s starting as systemd unit or something? Could you check couple more things:

cat /proc/<containerd-pid>/cmdline. it should have --address=/var/lib/k0s/run/containerd.sock
netstat -a -l -x: this shows us the active unix sockets

This is really puzzling, why is it able to create and listen on the ttrpc sock but not the normal one. 🤔 And why does this manifest only after reboot. Is your /var/lib path somehow differently mounted on/during boot?

I wonder if it could be something like SELinux, AppArmor or alike that's preventing containerd to create the unix socket?

jnummelin commented 3 years ago

(pushed wrong button) 🤦

kstych commented 3 years ago

hi @jnummelin , selinux is off, firewall is off, there is a single ext4 / partition I am running the command in a sreen session each time as root

post reboot netstat is attached (no sock file after reboot) commands are same

also after reboot CTRL+C doesnot stops the command (first time it does but leaves other processes running) after reboot pressing CTRL+C just keeps going like this

INFO[2020-12-22 18:46:06] Shutting down pid 1655                        component=containerd
INFO[2020-12-22 18:46:11] Shutting down pid 1655                        component=containerd
INFO[2020-12-22 18:46:16] Shutting down pid 1655                        component=containerd
INFO[2020-12-22 18:46:21] Shutting down pid 1655                        component=containerd
INFO[2020-12-22 18:46:26] Shutting down pid 1655                        component=containerd
^CINFO[2020-12-22 18:46:31] Shutting down pid 1655                        component=containerd

[root@k8s ~]# cat /proc/<first-run-containerd-pid>/cmdline 
/var/lib/k0s/bin/containerd--root=/var/lib/k0s/containerd--state=/var/lib/k0s/run/containerd--address=/var/lib/k0s/run/containerd.sock--log-level=info--config=/etc/k0s/containerd.toml

[root@k8s /]# cat /proc/<reboot-containerd-pid>/cmdline 
/var/lib/k0s/bin/containerd--root=/var/lib/k0s/containerd--state=/var/lib/k0s/run/containerd--address=/var/lib/k0s/run/containerd.sock--log-level=info--config=/etc/k0s/containerd.toml

mount

[root@k8s /]# mount
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
devtmpfs on /dev type devtmpfs (rw,nosuid,noexec,size=8138164k,nr_inodes=2034541,mode=755,inode64)
securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev,inode64)
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)
tmpfs on /run type tmpfs (rw,nosuid,nodev,size=3263456k,nr_inodes=819200,mode=755,inode64)
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate)
pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime)
efivarfs on /sys/firmware/efi/efivars type efivarfs (rw,nosuid,nodev,noexec,relatime)
none on /sys/fs/bpf type bpf (rw,nosuid,nodev,noexec,relatime,mode=700)
none on /sys/kernel/tracing type tracefs (rw,relatime)
/dev/sda3 on / type ext4 (rw,relatime)
systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=30,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=15686)
mqueue on /dev/mqueue type mqueue (rw,nosuid,nodev,noexec,relatime)
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M)
debugfs on /sys/kernel/debug type debugfs (rw,nosuid,nodev,noexec,relatime)
fusectl on /sys/fs/fuse/connections type fusectl (rw,nosuid,nodev,noexec,relatime)
configfs on /sys/kernel/config type configfs (rw,nosuid,nodev,noexec,relatime)
nfsd on /proc/fs/nfsd type nfsd (rw,relatime)
tmpfs on /tmp type tmpfs (rw,nosuid,nodev,size=8158636k,nr_inodes=409600,inode64)
/dev/sda2 on /boot type ext4 (rw,relatime)
/dev/sda1 on /boot/efi type vfat (rw,relatime,fmask=0077,dmask=0077,codepage=437,iocharset=ascii,shortname=winnt,errors=remount-ro)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw,relatime)
tmpfs on /run/user/1000 type tmpfs (rw,nosuid,nodev,relatime,size=1631724k,nr_inodes=407931,mode=700,uid=1000,gid=1000,inode64)
binfmt_misc on /proc/sys/fs/binfmt_misc type binfmt_misc (rw,nosuid,nodev,noexec,relatime)
tracefs on /sys/kernel/debug/tracing type tracefs (rw,nosuid,nodev,noexec,relatime)

netstat.txt

usrbinkat commented 3 years ago

Confirming both of my test cases are exhibiting the same behavior on rc1 0.9.1: Scenario:

UEFI Fedora 33 server with / ext4 single partition
BIOS Ubuntu 20.10 server with / xfs single partition

Manually killing/replacing the containerd process with log set to debug:

root@ubuntu:~# /var/lib/k0s/bin/containerd --root=/var/lib/k0s/containerd --state=/var/lib/k0s/run/containerd --address=/var/lib/k0s/run/containerd.sock --log-level=debug --config=/etc/k0s/containerd.toml                                                                              
INFO[2020-12-23T06:17:31.282915843Z] starting containerd                           revision=269548fa27e0089a8b8278fc4fc781d7f65a939b version=v1.4.3                                                                                                                                       
INFO[2020-12-23T06:17:31.316922600Z] loading plugin "io.containerd.content.v1.content"...  type=io.containerd.content.v1
INFO[2020-12-23T06:17:31.316991706Z] loading plugin "io.containerd.snapshotter.v1.aufs"...  type=io.containerd.snapshotter.v1
INFO[2020-12-23T06:17:31.321878156Z] loading plugin "io.containerd.snapshotter.v1.btrfs"...  type=io.containerd.snapshotter.v1
INFO[2020-12-23T06:17:31.322265753Z] skip loading plugin "io.containerd.snapshotter.v1.btrfs"...  error="path /var/lib/k0s/containerd/io.containerd.snapshotter.v1.btrfs (xfs) must be a btrfs filesystem to be used with the btrfs snapshotter: skip plugin" type=io.containerd.snapshotter.v1
INFO[2020-12-23T06:17:31.322304350Z] loading plugin "io.containerd.snapshotter.v1.devmapper"...  type=io.containerd.snapshotter.v1
WARN[2020-12-23T06:17:31.322328449Z] failed to load plugin io.containerd.snapshotter.v1.devmapper  error="devmapper not configured"
INFO[2020-12-23T06:17:31.322342183Z] loading plugin "io.containerd.snapshotter.v1.native"...  type=io.containerd.snapshotter.v1
INFO[2020-12-23T06:17:31.322366042Z] loading plugin "io.containerd.snapshotter.v1.overlayfs"...  type=io.containerd.snapshotter.v1
INFO[2020-12-23T06:17:31.322445232Z] loading plugin "io.containerd.snapshotter.v1.zfs"...  type=io.containerd.snapshotter.v1
INFO[2020-12-23T06:17:31.322651908Z] skip loading plugin "io.containerd.snapshotter.v1.zfs"...  error="path /var/lib/k0s/containerd/io.containerd.snapshotter.v1.zfs must be a zfs filesystem to be used with the zfs snapshotter: skip plugin" type=io.containerd.snapshotter.v1         
INFO[2020-12-23T06:17:31.322670913Z] loading plugin "io.containerd.metadata.v1.bolt"...  type=io.containerd.metadata.v1
WARN[2020-12-23T06:17:31.322689279Z] could not use snapshotter devmapper in metadata plugin  error="devmapper not configured"
INFO[2020-12-23T06:17:31.322697789Z] metadata content store policy set             policy=shared

usrbinkat commented 3 years ago

Okay maybe konnectivity-server is throwing us for loops?:

level=info msg="Error: failed to run the master server: failed to get uds listener: failed to listen(unix) name /var/lib/k0s/run/konnectivity-server/konnectivity-server.sock: listen unix /var/lib/k0s/run/konnectivity-server/konnectivity-server.sock: bind: address already in use" component=konnectivity

So this is interesting because konnectivity is attempting to listen on these ports:

Server port set to 0." component=konnectivity
Agent port set to 8132." component=konnectivity
Admin port set to 8133." component=konnectivity
Health port set to 8092." component=konnectivity

but the host reports:

root@ubuntu:~# netstat -tulpn
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 127.0.0.1:2379          0.0.0.0:*               LISTEN      2753/etcd
tcp        0      0 192.168.1.124:2380      0.0.0.0:*               LISTEN      2753/etcd
tcp        0      0 127.0.0.1:10257         0.0.0.0:*               LISTEN      2775/kube-controlle
tcp        0      0 127.0.0.1:10259         0.0.0.0:*               LISTEN      2774/kube-scheduler
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      890/sshd: /usr/sbin
tcp6       0      0 :::10251                :::*                    LISTEN      2774/kube-scheduler
tcp6       0      0 :::6443                 :::*                    LISTEN      2772/kube-apiserver
tcp6       0      0 :::10252                :::*                    LISTEN      2775/kube-controlle
tcp6       0      0 :::22                   :::*                    LISTEN      890/sshd: /usr/sbin
tcp6       0      0 :::9443                 :::*                    LISTEN      2777/k0s
udp        0      0 192.168.1.124:68        0.0.0.0:*                           711/systemd-network

.... and digging further, it appears containerd.sock is our higher level issue:

Dec 23 06:54:49 ubuntu k0s[2741]: time="2020-12-23 06:54:49" level=info msg="I1223 06:54:49.689717   10500 container_manager_linux.go:279] Creating Container Manager object based on Node Config: {RuntimeCgroupsName:/system.slice/containerd.service SystemCgroupsName: KubeletCgroupsName:/system.slice/containerd.service ContainerRuntime:remote CgroupsPerQOS:true CgroupRoot:/ CgroupDriver:cgroupfs KubeletRootDir:/var/lib/k0s/kubelet ProtectKernelDefaults:false NodeAllocatableConfig:{KubeReservedCgroupName:system.slice SystemReservedCgroupName: ReservedSystemCPUs: EnforceNodeAllocatable:map[pods:{}] KubeReserved:map[] SystemReserved:map[] HardEvictionThresholds:[{Signal:nodefs.inodesFree Operator:LessThan Value:{Quantity:<nil> Percentage:0.05} GracePeriod:0s MinReclaim:<nil>} {Signal:imagefs.available Operator:LessThan Value:{Quantity:<nil> Percentage:0.15} GracePeriod:0s MinReclaim:<nil>} {Signal:memory.available Operator:LessThan Value:{Quantity:100Mi Percentage:0} GracePeriod:0s MinReclaim:<nil>} {Signal:nodefs.available Operator:LessThan Value:{Quantity:<nil> Percentage:0.1} GracePeriod:0s MinReclaim:<nil>}]} QOSReserved:map[] ExperimentalCPUManagerPolicy:none ExperimentalTopologyManagerScope:container ExperimentalCPUManagerReconcilePeriod:10s ExperimentalPodPidsLimit:-1 EnforceCPULimits:true CPUCFSQuotaPeriod:100ms ExperimentalTopologyManagerPolicy:none}" component=kubelet
Dec 23 06:54:49 ubuntu k0s[2741]: time="2020-12-23 06:54:49" level=info msg="I1223 06:54:49.690172   10500 topology_manager.go:120] [topologymanager] Creating topology manager with none policy per container scope" component=kubelet
Dec 23 06:54:49 ubuntu k0s[2741]: time="2020-12-23 06:54:49" level=info msg="I1223 06:54:49.690193   10500 container_manager_linux.go:310] [topologymanager] Initializing Topology Manager with none policy and container-level scope" component=kubelet
Dec 23 06:54:49 ubuntu k0s[2741]: time="2020-12-23 06:54:49" level=info msg="I1223 06:54:49.690200   10500 container_manager_linux.go:315] Creating device plugin manager: true" component=kubelet
Dec 23 06:54:49 ubuntu k0s[2741]: time="2020-12-23 06:54:49" level=info msg="I1223 06:54:49.690340   10500 remote_runtime.go:62] parsed scheme: \"\"" component=kubelet
Dec 23 06:54:49 ubuntu k0s[2741]: time="2020-12-23 06:54:49" level=info msg="I1223 06:54:49.690351   10500 remote_runtime.go:62] scheme \"\" not registered, fallback to default scheme" component=kubelet
Dec 23 06:54:49 ubuntu k0s[2741]: time="2020-12-23 06:54:49" level=info msg="I1223 06:54:49.691045   10500 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{/var/lib/k0s/run/containerd.sock  <nil> 0 <nil>}] <nil> <nil>}" component=kubelet
Dec 23 06:54:49 ubuntu k0s[2741]: time="2020-12-23 06:54:49" level=info msg="I1223 06:54:49.691062   10500 clientconn.go:948] ClientConn switching balancer to \"pick_first\"" component=kubelet
Dec 23 06:54:49 ubuntu k0s[2741]: time="2020-12-23 06:54:49" level=info msg="I1223 06:54:49.691141   10500 remote_image.go:50] parsed scheme: \"\"" component=kubelet
Dec 23 06:54:49 ubuntu k0s[2741]: time="2020-12-23 06:54:49" level=info msg="I1223 06:54:49.691155   10500 remote_image.go:50] scheme \"\" not registered, fallback to default scheme" component=kubelet
Dec 23 06:54:49 ubuntu k0s[2741]: time="2020-12-23 06:54:49" level=info msg="I1223 06:54:49.691166   10500 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{/var/lib/k0s/run/containerd.sock  <nil> 0 <nil>}] <nil> <nil>}" component=kubelet
Dec 23 06:54:49 ubuntu k0s[2741]: time="2020-12-23 06:54:49" level=info msg="I1223 06:54:49.691170   10500 clientconn.go:948] ClientConn switching balancer to \"pick_first\"" component=kubelet
Dec 23 06:54:49 ubuntu k0s[2741]: time="2020-12-23 06:54:49" level=info msg="I1223 06:54:49.691196   10500 kubelet.go:273] Watching apiserver" component=kubelet
Dec 23 06:54:49 ubuntu k0s[2741]: time="2020-12-23 06:54:49" level=info msg="W1223 06:54:49.691676   10500 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {/var/lib/k0s/run/containerd.sock  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial unix /var/lib/k0s/run/containerd.sock: connect: no such file or directory\". Reconnecting..." component=kubelet
Dec 23 06:54:49 ubuntu k0s[2741]: time="2020-12-23 06:54:49" level=info msg="E1223 06:54:49.692689   10500 remote_runtime.go:86] Version from runtime service failed: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /var/lib/k0s/run/containerd.sock: connect: no such file or directory\"" component=kubelet
Dec 23 06:54:49 ubuntu k0s[2741]: time="2020-12-23 06:54:49" level=info msg="W1223 06:54:49.692751   10500 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {/var/lib/k0s/run/containerd.sock  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial unix /var/lib/k0s/run/containerd.sock: connect: no such file or directory\". Reconnecting..." component=kubelet
Dec 23 06:54:49 ubuntu k0s[2741]: time="2020-12-23 06:54:49" level=info msg="E1223 06:54:49.692880   10500 kuberuntime_manager.go:202] Get runtime version failed: get remote runtime typed version failed: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /var/lib/k0s/run/containerd.sock: connect: no such file or directory\"" component=kubelet
Dec 23 06:54:49 ubuntu k0s[2741]: time="2020-12-23 06:54:49" level=info msg="F1223 06:54:49.693011   10500 server.go:269] failed to run Kubelet: failed to create kubelet: get remote runtime typed version failed: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /var/lib/k0s/run/containerd.sock: connect: no such file or directory\"" component=kubelet
Dec 23 06:54:49 ubuntu k0s[2741]: time="2020-12-23 06:54:49" level=info msg="goroutine 1 [running]:" component=kubelet

jnummelin commented 3 years ago

Missing containerd.sock here definitely seems to be the top level culprit. I wonder if reboot makes k0s/containerd go down "too hard" and thus something (maybe the socket file itself) is left lingering. That's what the bind: address already in use kinda hints on konnectivity.

One possible workaround to try is to remove everything under /var/lib/k0s/run after reboot and before k0s is started.

kstych commented 3 years ago

hi @jnummelin infact I just tried that and was going to report that it works (after reboot delete the /var/lib/k0s/run folder and then restart)

just wanted to know that it is safe? no useful files to keep here?

jnummelin commented 3 years ago

it is safe, if you do it when k0s and related processes are not running. there's only socket file, pid files and containerd state which is "ephemeral" and can be deleted on reboot.

jnummelin commented 3 years ago

Of course we need to come up with a proper solution for this.

Also I'm seriously thinking this is also a "bug" in containerd side. It's kinda not expected that it does get up-and-running but fails to listen on the configured socket. And nothing in the logs says it's not operational.

usrbinkat commented 3 years ago

Added the following to my Systemd unit for when I test again later on:

ExecStartPre=-/usr/bin/rm -rf /var/lib/k0s/run

Definitely not a pretty solution as stoping/starting/restarting should not have that effect in non host reboot scenarios.

usrbinkat commented 3 years ago

Do we have a quick read on the technical details in swapping out BYO CRI-O instead of Containerd and I can compare their behaviors?

I have crio.sock at /run/crio/crio.sock

I guess I found k0s worker --cri-socket flag, but i'm currently testing the all in one node k0s server --enable-worker method.

trawler commented 3 years ago

not cri-o, but docker: https://docs.k0sproject.io/v0.9.0/custom-cri-runtime/

usrbinkat commented 3 years ago

Yep, gotcha.

I just tracked down the supported flags.

Here we can set cri-socket on cli via k0s worker --cri-socket remote:unix:///run/crio/crio.sock But sub command k0s server does not support that flag

Sad day, I'm not situated to test a supported topology right now.

trawler commented 3 years ago

But sub command k0s server does not support that flag

Sad day, I'm not situated to test a supported topology right now.

That's a valid point. I opened #579 to track this feature request.

jasmingacic commented 3 years ago

579 has been closed and PR #592 has been merged.

--cri-socket is supported now.

jnummelin commented 3 years ago

591 fixes the containerd to use `/run` as a state dir which fixes this.

k0sproject / k0s

worker fails to start after reboot #523

579 has been closed and PR #592 has been merged.

591 fixes the containerd to use `/run` as a state dir which fixes this.

k0sproject / k0s

worker fails to start after reboot #523

579 has been closed and PR #592 has been merged.

591 fixes the containerd to use /run as a state dir which fixes this.

591 fixes the containerd to use `/run` as a state dir which fixes this.