Closed mzac closed 5 years ago
I'm also seeing the master crash-looping on Raspbian with containerd: exit status 2
after wiping and installing 0.9.0. Seems to be a regression specific to 0.9.0. Have you tried 0.8.1? After wiping (again) and installing 0.8.1 I had no problems.
I think the failed containerd process is:
root 1387 74.0 1.5 887012 63428 ? Sl 15:45 0:00 containerd -c /mnt/system/k3s/agent/etc/containerd/config.toml -a /run/k3s/containerd/containerd.sock --state /run/k3s/containerd --root /mnt/system/k3s/agent/containerd
I'm only seeing crash loops on the master node. The agent nodes meanwhile run fine, except for logging the disconnect when the master crashes.
I'm also seeing similar errors as @mzac and @nickbp. Definitely seems to be a regression in 0.9.0.
Same error on a fresh install of 0.9.0 on Raspbian
I upgraded in-place from 0.8.1 last night and had the same issue, downgraded back to 0.8.1 and everything is groovy again. No need to wipe if you upgraded and have issues, just use the install options to go back to 0.8.1 for now.
This is probably related to #60 and #750
Ran the server standalone to reproduce the error. One of those tickets mentioned a separate containerd log so I wanted to take a look.
Steps were:
DATA_DIR
location, in this case I used /mnt/system/k3s9
.$DATA_DIR/bin/k3s
: curl -L -o k3s https://github.com/rancher/k3s/releases/download/v0.9.0/k3s-armhf
K3S_NODE_NAME=k3s9 $DATA_DIR/bin/k3s server --data-dir $DATA_DIR
run.sh
and observe that the server process exits with containerd: exit status 2
$DATA_DIR/agent/containerd/containerd.log
and find the following:
time="2019-09-22T08:15:58.843010068+12:00" level=info msg="ImageCreate event &ImageCreate{Name:docker.io/coredns/coredns:1.6.3,Labels:map[string]string{io.cri-containerd.image: managed,},}"
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x1 pc=0x24af7e0]
goroutine 530 [running]: github.com/rancher/k3s/vendor/github.com/containerd/cgroups.(memoryController).Stat(0x64caa20, 0x6484f80, 0x7c, 0x6550660, 0x0, 0x0) /go/src/github.com/rancher/k3s/vendor/github.com/containerd/cgroups/memory.go:162 +0x3a0 github.com/rancher/k3s/vendor/github.com/containerd/cgroups.(cgroup).Stat.func1(0x64feb14, 0xa6c286d8, 0x64caa20, 0x6484f80, 0x7c, 0x6550660, 0x6967a40, 0x5e69580) /go/src/github.com/rancher/k3s/vendor/github.com/containerd/cgroups/cgroup.go:277 +0x60 created by github.com/rancher/k3s/vendor/github.com/containerd/cgroups.(*cgroup).Stat /go/src/github.com/rancher/k3s/vendor/github.com/containerd/cgroups/cgroup.go:275 +0x2ec
Attached logs:
- k3s server process with containerd error: [k3s-0.9.0.log](https://github.com/rancher/k3s/files/3639054/k3s-0.9.0.log)
- containerd process with segfault on cgroup Stat call:
[containerd-0.9.0.log](https://github.com/rancher/k3s/files/3639057/containerd-0.9.0.log)
System info:
root@pi-04:~# uname -a Linux pi-04 4.19.73-v7l+ #1267 SMP Fri Sep 20 18:12:09 BST 2019 armv7l GNU/Linux
root@pi-04:~# cat /etc/os-release PRETTY_NAME="Raspbian GNU/Linux 10 (buster)" NAME="Raspbian GNU/Linux" VERSION_ID="10" VERSION="10 (buster)" VERSION_CODENAME=buster ID=raspbian ID_LIKE=debian HOME_URL="http://www.raspbian.org/" SUPPORT_URL="http://www.raspbian.org/RaspbianForums" BUG_REPORT_URL="http://www.raspbian.org/RaspbianBugs"
edit to add:
root@pi-04:~# lsmod Module Size Used by xt_multiport 16384 1 veth 24576 0 nf_conntrack_netlink 40960 0 xt_nat 16384 10 ipt_REJECT 16384 0 nf_reject_ipv4 16384 1 ipt_REJECT xt_tcpudp 16384 38 xt_addrtype 16384 3 nft_chain_nat_ipv6 16384 4 nf_nat_ipv6 20480 1 nft_chain_nat_ipv6 xt_conntrack 16384 6 nf_tables 122880 1 nft_chain_nat_ipv6 nfnetlink 16384 2 nf_conntrack_netlink,nf_tables ipt_MASQUERADE 16384 6 xt_comment 16384 65 iptable_filter 16384 1 xt_mark 16384 7 iptable_nat 16384 2 nf_nat_ipv4 16384 2 ipt_MASQUERADE,iptable_nat nf_nat 36864 3 xt_nat,nf_nat_ipv6,nf_nat_ipv4 vxlan 49152 0 ip6_udp_tunnel 16384 1 vxlan udp_tunnel 16384 1 vxlan overlay 106496 7 ip_vs_wrr 16384 0 ip_vs_sh 16384 0 ip_vs_rr 16384 0 ip_vs 143360 6 ip_vs_wrr,ip_vs_rr,ip_vs_sh nf_conntrack 135168 8 ip_vs,xt_nat,ipt_MASQUERADE,nf_conntrack_netlink,nf_nat_ipv6,xt_conntrack,nf_nat_ipv4,nf_nat nf_defrag_ipv6 20480 1 nf_conntrack nf_defrag_ipv4 16384 1 nf_conntrack sha256_generic 20480 0 cfg80211 614400 0 rfkill 28672 2 cfg80211 8021q 32768 0 garp 16384 1 8021q br_netfilter 24576 0 bridge 135168 1 br_netfilter stp 16384 2 garp,bridge llc 16384 3 garp,bridge,stp btrfs 1294336 2 xor 16384 1 btrfs xor_neon 16384 1 xor zstd_decompress 73728 1 btrfs zstd_compress 188416 1 btrfs xxhash 20480 2 zstd_compress,zstd_decompress lzo_compress 16384 1 btrfs raid6_pq 110592 1 btrfs zlib_deflate 28672 1 btrfs sg 28672 0 bcm2835_codec 36864 0 bcm2835_v4l2 45056 0 v4l2_mem2mem 24576 1 bcm2835_codec bcm2835_mmal_vchiq 32768 2 bcm2835_codec,bcm2835_v4l2 v4l2_common 16384 1 bcm2835_v4l2 videobuf2_dma_contig 20480 1 bcm2835_codec videobuf2_vmalloc 16384 1 bcm2835_v4l2 videobuf2_memops 16384 2 videobuf2_dma_contig,videobuf2_vmalloc videobuf2_v4l2 24576 3 bcm2835_codec,bcm2835_v4l2,v4l2_mem2mem raspberrypi_hwmon 16384 0 videobuf2_common 45056 4 bcm2835_codec,bcm2835_v4l2,v4l2_mem2mem,videobuf2_v4l2 hwmon 16384 1 raspberrypi_hwmon videodev 200704 6 bcm2835_codec,v4l2_common,videobuf2_common,bcm2835_v4l2,v4l2_mem2mem,videobuf2_v4l2 media 36864 3 bcm2835_codec,videodev,v4l2_mem2mem vc_sm_cma 36864 1 bcm2835_mmal_vchiq rpivid_mem 16384 0 uio_pdrv_genirq 16384 0 uio 20480 1 uio_pdrv_genirq fixed 16384 0 ip_tables 24576 2 iptable_filter,iptable_nat x_tables 32768 11 xt_comment,xt_multiport,ipt_REJECT,xt_nat,ip_tables,iptable_filter,xt_mark,xt_tcpudp,ipt_MASQUERADE,xt_addrtype,xt_conntrack ipv6 450560 76 nf_nat_ipv6,bridge
K3s build info:
- Downloaded from `https://github.com/rancher/k3s/releases/download/v0.9.0/k3s-armhf`
- `k3s --version` = `k3s-0.9.0 version v0.9.0 (65d87648)`
- `sha1sum` = `3f2b03c8a82bf72cb8884e796616c8cdd5fb72ef`
For comparison here is the containerd.log from a successful k3s server startup on 0.8.1: containerd-0.8.1.log
The error doesn't seem to be specific to launching coredns as might be implied by the above containerd-0.9.0.log
. As an experiment I tried deleting server/manifests/coredns.yaml
to see what would happen. The same failure occurred again, just at a different time:
time="2019-09-22T10:24:54.752729883+12:00" level=info msg=serving... address=/run/k3s/containerd/containerd.sock
time="2019-09-22T10:24:54.752889509+12:00" level=info msg="containerd successfully booted in 0.093370s"
time="2019-09-22T10:24:54.794766405+12:00" level=warning msg="The image docker.io/coredns/coredns:1.6.3 is not unpacked."
time="2019-09-22T10:24:54.797485818+12:00" level=warning msg="The image docker.io/rancher/klipper-helm:v0.1.5 is not unpacked."
time="2019-09-22T10:24:54.811645964+12:00" level=info msg="Start event monitor"
time="2019-09-22T10:24:54.811795201+12:00" level=info msg="Start snapshots syncer"
time="2019-09-22T10:24:54.811857292+12:00" level=info msg="Start streaming server"
time="2019-09-22T10:24:55.523254023+12:00" level=info msg="No cni config template is specified, wait for other system components to drop the config."
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x1 pc=0x24af7e0]
goroutine 81 [running]:
github.com/rancher/k3s/vendor/github.com/containerd/cgroups.(*memoryController).Stat(0x7ee0e48, 0x7495300, 0x7d, 0x7848140, 0x0, 0x0)
/go/src/github.com/rancher/k3s/vendor/github.com/containerd/cgroups/memory.go:162 +0x3a0
github.com/rancher/k3s/vendor/github.com/containerd/cgroups.(*cgroup).Stat.func1(0x7cb4470, 0xa6ac7f48, 0x7ee0e48, 0x7495300, 0x7d, 0x7848140, 0x78453d0, 0x7894040)
/go/src/github.com/rancher/k3s/vendor/github.com/containerd/cgroups/cgroup.go:277 +0x60
created by github.com/rancher/k3s/vendor/github.com/containerd/cgroups.(*cgroup).Stat
/go/src/github.com/rancher/k3s/vendor/github.com/containerd/cgroups/cgroup.go:275 +0x2ec
Again its worth pointing out that k3s 0.9.0 agents do not seem to have any problem, the issue appears to be specific to the k3s 0.9.0 server.
I dropped in the 0.9.0-rc2 release of the k3s-armhf binary, then repeated the above steps to run a k3s server. The segfault went away. So it looks like the regression was introduced sometime between rc2 and GA: https://github.com/rancher/k3s/compare/v0.9.0-rc2...v0.9.0
k3s-0.9.0-rc2 version v0.9.0-rc2 (4a5360ea)
k3s-0.9.0 version v0.9.0 (65d87648)
@erikwilson @ibuildthecloud PTAL
Be advised that on raspbian when (still) using kernel version 4.19.42-v7 there seems to be no issue. I hyave tried with 66 and 73 (current latest) and k3s will not run.
It seems to be a kernel thing with arm32v7: on my armbian 4.19.62-sunxi k3s also fails. My arm64v8 nodes have no issues with latest kernels.
Edit: hm, not completely sure of this, it seems to work on and off. Have reverted to 0.8.1 to keep working env.
Users of https://k3sup.dev now have 0.8.0 as the pinned, default version. Try it out today whilst awaiting this fix
Thanks for the bug report, we fix this straight away and do a v0.9.1.
@ibuildthecloud Awesome thanks! I was trying to build my first k3s cluster when I ran into this bug... looking forward to trying it out :)
I had the same problem, thought it was something i was doing wrong. Then i came across this thread. If you don't want to wait till v0.9.1 and don't want to use k3sup, you can use the following command to install with v0.8.1.
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v0.8.1 sh -
You can also use https://k3sup.dev which automates the process
Uh, there was me thinking I had messed up when upgrading 🤦♂️
I am not sure exactly why, but it seems that downgrading to grpc 1.13.0 solves the issue. Please give k3s v0.9.1 a try @mzac and let me know if there are still problems.
also adding some panic logs in case they are helpful in the future: arm-panic.log
works fine for me
I've updated to 0.9.1 on the system described earlier and the segfault no longer occurs. I've been able to deploy everything successfully on 0.9.1.
Edit: Forgot to mention: Thanks for finding a workaround!
This is resolved for me as well now, thanks!
We'll be closing this issue now. v0.9.1 release should resolve the problem. I'm happy this release fixes the issue :)
@erikwilson Thanks this seems to fix the issue!
I am still having this issue, tried with 0.8.1 and 0.9.1 no luck
[INFO] Using v0.9.1 as release
[INFO] Downloading hash https://github.com/rancher/k3s/releases/download/v0.9.1/sha256sum-arm.txt
[INFO] Downloading binary https://github.com/rancher/k3s/releases/download/v0.9.1/k3s-armhf
[INFO] Verifying binary download
[INFO] Installing k3s to /usr/local/bin/k3s
[INFO] Creating /usr/local/bin/kubectl symlink to k3s
[INFO] Creating /usr/local/bin/crictl symlink to k3s
[INFO] Skipping /usr/local/bin/ctr symlink to k3s, command exists in PATH at /usr/bin/ctr
[INFO] Creating killall script /usr/local/bin/k3s-killall.sh
[INFO] Creating uninstall script /usr/local/bin/k3s-uninstall.sh
[INFO] env: Creating environment file /etc/systemd/system/k3s.service.env
[INFO] systemd: Creating service file /etc/systemd/system/k3s.service
[INFO] systemd: Enabling k3s unit
Created symlink /etc/systemd/system/multi-user.target.wants/k3s.service → /etc/systemd/system/k3s.service.
[INFO] systemd: Starting k3s
Job for k3s.service failed because the control process exited with error code.
See "systemctl status k3s.service" and "journalctl -xe" for details.
pi@RPI4-1:~/k3s $ systemctl status k3s.service
● k3s.service - Lightweight Kubernetes
Loaded: loaded (/etc/systemd/system/k3s.service; enabled; vendor preset: enabled)
Active: activating (start) since Sat 2019-10-05 15:16:47 PDT; 2s ago
Docs: https://k3s.io
Process: 13507 ExecStartPre=/sbin/modprobe br_netfilter (code=exited, status=0/SUCCESS)
Process: 13509 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
Main PID: 13511 (k3s-server)
Tasks: 15
Memory: 70.1M
CGroup: /system.slice/k3s.service
└─13511 /usr/local/bin/k3s server KillMode=process
Oct 05 15:16:47 RPI4-1 k3s[13511]: E1005 15:16:47.895207 13511 prometheus.go:203] failed to register unfinished_work_seconds metric admission_quota_controller: duplicate metrics collector registration attempted
Oct 05 15:16:47 RPI4-1 k3s[13511]: E1005 15:16:47.895403 13511 prometheus.go:216] failed to register longest_running_processor_microseconds metric admission_quota_controller: duplicate metrics collector registration attempted
Oct 05 15:16:47 RPI4-1 k3s[13511]: I1005 15:16:47.895595 13511 plugins.go:158] Loaded 10 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesBy
Oct 05 15:16:47 RPI4-1 k3s[13511]: I1005 15:16:47.895661 13511 plugins.go:161] Loaded 6 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,Validatin
Oct 05 15:16:47 RPI4-1 k3s[13511]: I1005 15:16:47.941334 13511 master.go:233] Using reconciler: lease
Oct 05 15:16:48 RPI4-1 k3s[13511]: W1005 15:16:48.705381 13511 genericapiserver.go:351] Skipping API batch/v2alpha1 because it has no resources.
Oct 05 15:16:48 RPI4-1 k3s[13511]: W1005 15:16:48.732925 13511 genericapiserver.go:351] Skipping API node.k8s.io/v1alpha1 because it has no resources.
Oct 05 15:16:48 RPI4-1 k3s[13511]: W1005 15:16:48.748523 13511 genericapiserver.go:351] Skipping API rbac.authorization.k8s.io/v1alpha1 because it has no resources.
Oct 05 15:16:48 RPI4-1 k3s[13511]: W1005 15:16:48.752262 13511 genericapiserver.go:351] Skipping API scheduling.k8s.io/v1alpha1 because it has no resources.
Oct 05 15:16:48 RPI4-1 k3s[13511]: W1005 15:16:48.761300 13511 genericapiserver.go:351] Skipping API storage.k8s.io/v1alpha1 because it has no resources.
Reference https://github.com/rancher/k3s/issues/869 (new issue created) for the last comment ^
Describe the bug I am trying to setup a k3s cluster with 3 raspberry Pis and no matter what I try I keep getting an error with containerd on the k3d setup. I have followed all the instructions:
...but nothing seems to work!
To Reproduce Install k3s from scratch and keeps failing
Screenshots