Closed JOUNAIDSoufiane closed 7 months ago
1 server, 2 enrolled agents, and the google coral dev board agent which is encountering a variety of issues using Calico CNI 3.24.1
K3s doesn't come with Calico by default. Do you run into this same problem when using Flannel?
If the issue only affects your nodes when you use Calico instead of Flannel, and your nodes are missing a kernel module that Calico requires, that does not sound like something we can fix in K3s.
The issue could be related to Calico but I have observed the same issue with flannel when the nf_conntrack_netlink kernel module is loaded. I am hoping for either a resolution of the issue or a justification for why the kernel is not compatible in this case. I included information about how we load the kernel modules as well which could very much be another point of potential mistake.
The issue pops up with flannel as well, likely an interaction with the kernel that goes wrong
This is not a k3s issue, and k3s does not support calico
I will also note that v1.23.17+k3s1 has been end of life since February 2023. Please try again with a non-end-of-life version of K3s. If in doubt, run k3s check-config
and ensure that you have all the listed kernel modules available.
I have tried with 1.27.11, the same issue arises unfortunately and check-config
confirms I have all the necessary kernel modules for k3s itself. I understand that this is probably not a k3s issue in itself but was just hoping on some pointers as to why this could be happening since it also happens with flannel
It appears that it's the kernel that's crashing, not K3s? If you are really using Linux 8fb60a5 4.14.98-imx
that kernel is from late 2017, is there nothing newer available for this device?
[ 256.688073] Unable to handle kernel NULL pointer dereference at virtual address 00000040
[ 256.704251] Mem abort info:
[ 256.709945] Exception class = DABT (current EL), IL = 32 bits
[ 256.721880] SET = 0, FnV = 0
[ 256.728142] EA = 0, S1PTW = 0
[ 256.734520] Data abort info:
[ 256.740377] ISV = 0, ISS = 0x00000006
[ 256.748146] CM = 0, WnR = 0
[ 256.754179] user pgtable: 4k pages, 48-bit VAs, pgd = ffff80001f9f3000
[ 256.767329] [0000000000000040] *pgd=000000005f9f8003, *pud=000000005fa76003, *pmd=0000000000000000
[ 256.785345] Internal error: Oops: 96000006 [#1] PREEMPT SMP
Yes, I did mention in the title that the use of the kmod actually crashes the entire host OS by segfaulting the kernel. The reason we use the 4.14.98 version is because this is the latest provided Balena OS version for this device which is required for enrolling it in our environment. I'll switch kernels to one of google's provided kernels and see if this happens.
Yeah sorry, I think I missed that bit initially. It sounds like you are stuck in a challenging environment, with a lot of out-of-date components. I don't think it's something we can help with.
Environmental Info:
K3s Version:
Node(s) CPU architecture, OS, and Version:
Cluster Configuration:
1 server, 2 enrolled agents, and the google coral dev board agent which is encountering a variety of issues using Calico CNI 3.24.1
Describe the bug:
Starting K3S agent on the google coral dev board is crashing the host kernel when started with the kernel module
nf_conntrack_netlink
, does not crash when started without but then calico is unable to initialize on the agent.Expected behavior:
The issue could be related to Calico but I have observed the same issue with flannel when the
nf_conntrack_netlink
kernel module is loaded. I am hoping for either a resolution of the issue or a justification for why the kernel is not compatible in this case. I included information about how we load the kernel modules as well which could very much be another point of potential mistake.Steps To Reproduce and Actual Behavior
Context Environment
We are working on enrolling the google coral dev board onto our existing balena-fleet that runs a collection of raspberry pis and nvidia Jetson nanos in the following configuration:
After the above steps, the devices are able to start k3s agent in a container and wireguard in another and join our k3s cluster that is running Calico as its CNI.
Our process for enrolling the Google Coral Dev Board
Here are the logs for what happens when starting the k3s agent with ALL the kernel modules loaded
Dmesg logs on the host kernel
K3S agent logs (1.23.17 but also crashes on the latest stable)
Debugging the crash
After manually loading the kernel modules one by one, We managed to identify the kernel module that causes the crash:
nf_conntrack_netlink
. The K3S agent starts fine with all the other kernel modules loaded but crashes the kernel as soon as it is started with the offending kmod loaded.