k3s-io / k3s

Lightweight Kubernetes
https://k3s.io
Apache License 2.0
28.09k stars 2.35k forks source link

K3S agent starting on Google Coral crashes the host kernel due to nf_conntrack_netlink kernel module #9967

Closed JOUNAIDSoufiane closed 7 months ago

JOUNAIDSoufiane commented 7 months ago

Environmental Info:

K3s Version:

k3s version v1.23.17+k3s1 (abb8d7d4)
go version go1.19.6

Node(s) CPU architecture, OS, and Version:

Linux 8fb60a5 4.14.98-imx #1 SMP PREEMPT Wed Feb 15 18:40:35 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux

Cluster Configuration:

1 server, 2 enrolled agents, and the google coral dev board agent which is encountering a variety of issues using Calico CNI 3.24.1

Describe the bug:

Starting K3S agent on the google coral dev board is crashing the host kernel when started with the kernel module nf_conntrack_netlink, does not crash when started without but then calico is unable to initialize on the agent.

Expected behavior:

The issue could be related to Calico but I have observed the same issue with flannel when the nf_conntrack_netlink kernel module is loaded. I am hoping for either a resolution of the issue or a justification for why the kernel is not compatible in this case. I included information about how we load the kernel modules as well which could very much be another point of potential mistake.

Steps To Reproduce and Actual Behavior

Context Environment

We are working on enrolling the google coral dev board onto our existing balena-fleet that runs a collection of raspberry pis and nvidia Jetson nanos in the following configuration:

After the above steps, the devices are able to start k3s agent in a container and wireguard in another and join our k3s cluster that is running Calico as its CNI.

Our process for enrolling the Google Coral Dev Board

Here are the logs for what happens when starting the k3s agent with ALL the kernel modules loaded

Dmesg logs on the host kernel

[   40.838652] random: crng init done
[   40.851547] EXT4-fs (mmcblk0p2): re-mounted. Opts: (null)
[   40.874565] EXT4-fs (mmcblk0p6): mounted filesystem with ordered data mode. Opts: (null)
[   41.028904] systemd[1]: System time before build time, advancing clock.
[   41.097585] systemd[1]: File /lib/systemd/system/systemd-journald.service:12 configures an IP firewall (IPAddressDeny=any), but the local system does not support BPF/cgroup based firewalling.
[   41.097598] systemd[1]: Proceeding WITHOUT firewalling in effect! (This warning is only shown for the first loaded unit using IP firewalling.)
[   41.209113] systemd[1]: /lib/systemd/system/chronyd.service:25: Unknown key name 'ProcSubset' in section 'Service', ignoring.
[   41.209146] systemd[1]: /lib/systemd/system/chronyd.service:28: Unknown key name 'ProtectHostname' in section 'Service', ignoring.
[   41.209163] systemd[1]: /lib/systemd/system/chronyd.service:29: Unknown key name 'ProtectKernelLogs' in section 'Service', ignoring.
[   41.209194] systemd[1]: /lib/systemd/system/chronyd.service:32: Unknown key name 'ProtectProc' in section 'Service', ignoring.
[   42.247023] imx-sdma 30bd0000.sdma: no iram assigned, using external mem
[   42.256266] imx-sdma 30bd0000.sdma: loaded firmware 4.2
[   42.259899] imx-sdma 302c0000.sdma: no iram assigned, using external mem
[   42.268096] imx-sdma 302c0000.sdma: loaded firmware 4.2
[   42.348589] ina2xx 1-0040: error configuring the device: -6
[   42.361241] ina2xx 1-0041: error configuring the device: -6
[   42.750411] zram: Can't change algorithm for initialized device
[   43.627353] Adding 503584k swap on /dev/zram0.  Priority:-2 extents:1 across:503584k SS
[   43.910583] wlan: loading out-of-tree module taints kernel.
[   43.975040] wlan: loading driver v4.5.23.1
[   43.975387] hif_pci_probe:, con_mode= 0x0
[   43.975397] PCI device id is 003e :003e
[   43.975417] hif_pci 0000:01:00.0: BAR 0: assigned [mem 0x18000000-0x181fffff 64bit]
[   43.975548] hif_pci 0000:01:00.0: enabling device (0000 -> 0002)
[   43.976718]
                hif_pci_configure : num_desired MSI set to 1
[   44.054114] hif_pci_probe: ramdump base 0xffff800024e00000 size 2095136
[   44.126366] NUM_DEV=1 FWMODE=0x2 FWSUBMODE=0x0 FWBR_BUF 0
[   44.779370] +HWT
[   44.796852] -HWT
[   44.820250] HTT: full reorder offload enabled
[   44.860930] Pkt log is disabled
[   44.865835] Host SW:4.5.23.1, FW:2.0.1.1048, HW:QCA6174_REV3_2
[   44.866430] ol_pktlog_init: pktlogmod_init successfull
[   44.866722] wlan: driver loaded in 892000
[   44.870061] target uses HTT version 3.50; host uses 3.28
[   47.488191] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[   47.495084] Generic PHY 30be0000.ethernet-1:00: attached PHY driver [Generic PHY] (mii_bus:phy_addr=30be0000.ethernet-1:00, irq=POLL)
[   47.495751] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[   47.534226] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[   47.534572] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[   47.668851] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[   51.593483] fec 30be0000.ethernet eth0: Link is Up - 1Gbps/Full - flow control rx/tx
[   51.593510] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[   65.517525] Bridge firewalling registered
[   65.630300] Initializing XFRM netlink socket
[   65.641279] Netfilter messages via NETLINK v0.30.
[   65.900887] IPv6: ADDRCONF(NETDEV_UP): supervisor0: link is not ready
[   65.995140] IPv6: ADDRCONF(NETDEV_UP): balena0: link is not ready
[   68.796835] ipip: IPv4 and MPLS over IPv4 tunneling driver
[   73.869504] ip6_tables: (C) 2000-2006 Netfilter Core Team
[   79.811546] EXT4-fs (mmcblk0p3): mounted filesystem with ordered data mode. Opts: (null)
[  235.903492] ctnetlink v0.93: registering with nfnetlink.
[  255.787535] ip_set: protocol 6
[  256.065155] IPVS: [rr] scheduler registered.
[  256.688073] Unable to handle kernel NULL pointer dereference at virtual address 00000040
[  256.704251] Mem abort info:
[  256.709945]   Exception class = DABT (current EL), IL = 32 bits
[  256.721880]   SET = 0, FnV = 0
[  256.728142]   EA = 0, S1PTW = 0
[  256.734520] Data abort info:
[  256.740377]   ISV = 0, ISS = 0x00000006
[  256.748146]   CM = 0, WnR = 0
[  256.754179] user pgtable: 4k pages, 48-bit VAs, pgd = ffff80001f9f3000
[  256.767329] [0000000000000040] *pgd=000000005f9f8003, *pud=000000005fa76003, *pmd=0000000000000000
[  256.785345] Internal error: Oops: 96000006 [#1] PREEMPT SMP

K3S agent logs (1.23.17 but also crashes on the latest stable)

INFO[0001] Starting k3s agent v1.23.17+k3s1 (abb8d7d4)
INFO[0001] Running load balancer k3s-agent-load-balancer 127.0.0.1:6444 -> [129.114.34.140:6443 dev.edge.chameleoncloud.org:6443]
WARN[0001] Cluster CA certificate is not trusted by the host CA bundle, but the token does not include a CA hash. Use the full token from the server's node-token file to enable Cluster CA validation.
INFO[0003] Module overlay was already loaded
INFO[0003] Module nf_conntrack was already loaded
INFO[0003] Module br_netfilter was already loaded
INFO[0003] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_close_wait' to 3600
INFO[0003] Set sysctl 'net/netfilter/nf_conntrack_max' to 131072
INFO[0003] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_established' to 86400
INFO[0003] Logging containerd to /var/lib/rancher/k3s/agent/containerd/containerd.log
INFO[0003] Running containerd -c /var/lib/rancher/k3s/agent/etc/containerd/config.toml -a /run/k3s/containerd/containerd.sock --state /run/k3s/containerd --root /var/lib/rancher/k3s/agent/containerd
INFO[0004] Containerd is now running
INFO[0004] Getting list of apiserver endpoints from server
INFO[0005] Tunnel authorizer set Kubelet Port 10250
INFO[0005] Updating load balancer k3s-agent-load-balancer default server address -> 129.114.34.140:6443
INFO[0005] Connecting to proxy                           url="wss://129.114.34.140:6443/v1-k3s/connect"
WARN[0005] Disabling CPU quotas due to missing cpu controller or cpu.cfs_period_us
INFO[0005] Running kubelet --address=0.0.0.0 --allowed-unsafe-sysctls=net.ipv4.ip_forward,net.ipv6.conf.all.forwarding --anonymous-auth=false --authentication-token-webhook=true --authorization-mode=Webhook --cgroup-driver=cgroupfs --client-ca-file=/var/lib/rancher/k3s/agent/client-ca.crt --cloud-provider=external --cluster-dns=10.43.0.10 --cluster-domain=cluster.local --container-runtime=remote --container-runtime-endpoint=unix:///run/k3s/containerd/containerd.sock --containerd=/run/k3s/containerd/containerd.sock --cpu-cfs-quota=false --eviction-hard=imagefs.available<5%,nodefs.available<5% --eviction-minimum-reclaim=imagefs.available=10%,nodefs.available=10% --fail-swap-on=false --healthz-bind-address=127.0.0.1 --hostname-override=8fb60a5 --kubeconfig=/var/lib/rancher/k3s/agent/kubelet.kubeconfig --kubelet-cgroups=/k3s --node-labels= --pod-manifest-path=/var/lib/rancher/k3s/agent/pod-manifests --read-only-port=0 --resolv-conf=/etc/resolv.conf --serialize-image-pulls=false --tls-cert-file=/var/lib/rancher/k3s/agent/serving-kubelet.crt --tls-private-key-file=/var/lib/rancher/k3s/agent/serving-kubelet.key --volume-plugin-dir=/opt/libexec/kubernetes/kubelet-plugins/volume/exec
Flag --cloud-provider has been deprecated, will be removed in 1.24 or later, in favor of removing cloud provider code from Kubelet.
Flag --containerd has been deprecated, This is a cadvisor flag that was mistakenly registered with the Kubelet. Due to legacy concerns, it will follow the standard CLI deprecation timeline before being removed.
I0416 18:57:06.107611    1338 server.go:442] "Kubelet version" kubeletVersion="v1.23.17+k3s1"
I0416 18:57:06.111875    1338 dynamic_cafile_content.go:156] "Starting controller" name="client-ca-bundle::/var/lib/rancher/k3s/agent/client-ca.crt"
INFO[0005] Annotations and labels have already set on node: 8fb60a5
INFO[0006] Running kube-proxy --cluster-cidr=192.168.64.0/18 --conntrack-max-per-core=0 --conntrack-tcp-timeout-close-wait=0s --conntrack-tcp-timeout-established=0s --healthz-bind-address=127.0.0.1 --hostname-override=8fb60a5 --kubeconfig=/var/lib/rancher/k3s/agent/kubeproxy.kubeconfig --proxy-mode=iptables
I0416 18:57:06.604015    1338 server.go:224] "Warning, all flags other than --config, --write-config-to, and --cleanup are deprecated, please begin using a config file ASAP"
INFO[0006] Starting the netpol controller version v1.5.2-0.20221026101626-e01045262706, built on 2023-03-10T21:33:49Z, go1.19.6
I0416 18:57:06.623003    1338 network_policy_controller.go:163] Starting network policy controller
I0416 18:57:06.626245    1338 proxier.go:652] "Failed to load kernel module with modprobe, you can ignore this message when kube-proxy is running inside container without mounting /lib/modules" moduleName="ip_vs_wrr"
I0416 18:57:06.631820    1338 proxier.go:652] "Failed to load kernel module with modprobe, you can ignore this message when kube-proxy is running inside container without mounting /lib/modules" moduleName="ip_vs_sh"
I0416 18:57:06.679697    1338 network_policy_controller.go:175] Starting network policy controller full sync goroutine
I0416 18:57:06.811294    1338 node.go:163] Successfully retrieved node IP: 192.168.1.201
I0416 18:57:06.811465    1338 server_others.go:138] "Detected node IP" address="192.168.1.201"
I0416 18:57:06.895718    1338 server_others.go:206] "Using iptables Proxier"
I0416 18:57:06.896101    1338 server_others.go:213] "kube-proxy running in dual-stack mode" ipFamily=IPv4
I0416 18:57:06.896288    1338 server_others.go:214] "Creating dualStackProxier for iptables"
I0416 18:57:06.896494    1338 server_others.go:502] "Detect-local-mode set to ClusterCIDR, but no IPv6 cluster CIDR defined, , defaulting to no-op detect-local for IPv6"
I0416 18:57:06.898929    1338 server.go:656] "Version info" version="v1.23.17+k3s1"
I0416 18:57:06.911637    1338 config.go:444] "Starting node config controller"
I0416 18:57:06.912495    1338 shared_informer.go:240] Waiting for caches to sync for node config
I0416 18:57:06.911638    1338 config.go:226] "Starting endpoint slice config controller"
I0416 18:57:06.912773    1338 shared_informer.go:240] Waiting for caches to sync for endpoint slice config
I0416 18:57:06.911692    1338 config.go:317] "Starting service config controller"
I0416 18:57:06.912992    1338 shared_informer.go:240] Waiting for caches to sync for service config
I0416 18:57:07.013755    1338 shared_informer.go:247] Caches are synced for node config
I0416 18:57:07.113248    1338 shared_informer.go:247] Caches are synced for endpoint slice config
I0416 18:57:07.113317    1338 shared_informer.go:247] Caches are synced for service config

Debugging the crash

After manually loading the kernel modules one by one, We managed to identify the kernel module that causes the crash: nf_conntrack_netlink. The K3S agent starts fine with all the other kernel modules loaded but crashes the kernel as soon as it is started with the offending kmod loaded.

brandond commented 7 months ago

1 server, 2 enrolled agents, and the google coral dev board agent which is encountering a variety of issues using Calico CNI 3.24.1

K3s doesn't come with Calico by default. Do you run into this same problem when using Flannel?

If the issue only affects your nodes when you use Calico instead of Flannel, and your nodes are missing a kernel module that Calico requires, that does not sound like something we can fix in K3s.

JOUNAIDSoufiane commented 7 months ago

The issue could be related to Calico but I have observed the same issue with flannel when the nf_conntrack_netlink kernel module is loaded. I am hoping for either a resolution of the issue or a justification for why the kernel is not compatible in this case. I included information about how we load the kernel modules as well which could very much be another point of potential mistake.

The issue pops up with flannel as well, likely an interaction with the kernel that goes wrong

cwayne18 commented 7 months ago

This is not a k3s issue, and k3s does not support calico

brandond commented 7 months ago

I will also note that v1.23.17+k3s1 has been end of life since February 2023. Please try again with a non-end-of-life version of K3s. If in doubt, run k3s check-config and ensure that you have all the listed kernel modules available.

JOUNAIDSoufiane commented 7 months ago

I have tried with 1.27.11, the same issue arises unfortunately and check-config confirms I have all the necessary kernel modules for k3s itself. I understand that this is probably not a k3s issue in itself but was just hoping on some pointers as to why this could be happening since it also happens with flannel

brandond commented 7 months ago

It appears that it's the kernel that's crashing, not K3s? If you are really using Linux 8fb60a5 4.14.98-imx that kernel is from late 2017, is there nothing newer available for this device?

[  256.688073] Unable to handle kernel NULL pointer dereference at virtual address 00000040
[  256.704251] Mem abort info:
[  256.709945]   Exception class = DABT (current EL), IL = 32 bits
[  256.721880]   SET = 0, FnV = 0
[  256.728142]   EA = 0, S1PTW = 0
[  256.734520] Data abort info:
[  256.740377]   ISV = 0, ISS = 0x00000006
[  256.748146]   CM = 0, WnR = 0
[  256.754179] user pgtable: 4k pages, 48-bit VAs, pgd = ffff80001f9f3000
[  256.767329] [0000000000000040] *pgd=000000005f9f8003, *pud=000000005fa76003, *pmd=0000000000000000
[  256.785345] Internal error: Oops: 96000006 [#1] PREEMPT SMP
JOUNAIDSoufiane commented 7 months ago

Yes, I did mention in the title that the use of the kmod actually crashes the entire host OS by segfaulting the kernel. The reason we use the 4.14.98 version is because this is the latest provided Balena OS version for this device which is required for enrolling it in our environment. I'll switch kernels to one of google's provided kernels and see if this happens.

brandond commented 7 months ago

Yeah sorry, I think I missed that bit initially. It sounds like you are stuck in a challenging environment, with a lot of out-of-date components. I don't think it's something we can help with.