Closed JOUNAIDSoufiane closed 3 months ago
@JOUNAIDSoufiane do you have a kernel stack trace? Does it faul because of the missing module?
I am currently trying my best to get a call trace out of this. I'll post one as soon as I have it. Is there, in the meanwhile, way to start calico sucessfully without the nf_conntrack_netlink module? that specific module (which is cited as required for calico) happens to cause the crash when I start k3s agent. When I start it without that module, k3s agent runs and joins the cluster but calico does not initialize, I have included logs above of how the calico-node init containers refuse to start in this case.
Destanation directory /host/driver not present
Comes from here https://github.com/projectcalico/calico/blob/master/pod2daemon/flexvol/docker-image/flexvol.sh#L55
I think that is something created by k8s and it is just missing if you run it in the simplistic way as you do with k3s crt
the calico-node pod should start and possibly throw other errors related to the missing kernel module.
calicoctl checksystem
check is you have all the prerequisities. Here is a list of the modules that calicoctl checks for currently.
Comes from here https://github.com/projectcalico/calico/blob/master/pod2daemon/flexvol/docker-image/flexvol.sh#L55 I think that is something created by k8s and it is just missing if you run it in the simplistic way as you do with k3s crt
I see, thank you for that clarification, here is my output of calicoctl check system
Checking kernel version...
4.14.98-imx OK
Checking kernel modules...
ip_tables OK
WARNING: Unable to detect the ipt_ipvs module as Loaded/Builtin module or lsmod
ipt_ipvs FAIL
xt_bpf OK
ipt_rpfilter OK
WARNING: Unable to detect the ipt_set module as Loaded/Builtin module or lsmod
ipt_set FAIL
xt_set OK
xt_u32 OK
ip6_tables OK
WARNING: Unable to detect the xt_rpfilter module as Loaded/Builtin module or lsmod
xt_rpfilter FAIL
WARNING: Unable to detect the nf_conntrack_netlink module as Loaded/Builtin module or lsmod
nf_conntrack_netlink FAIL
xt_icmp OK
xt_multiport OK
WARNING: Unable to detect the vfio-pci module as Loaded/Builtin module or lsmod
vfio-pci FAIL
xt_addrtype OK
xt_conntrack OK
xt_mark OK
ipt_REJECT OK
xt_icmp6 OK
ip_set OK
I purposely unloaded nf_conntrack_netlink as it causes a crash when starting k3s agent with calico; as for the other missing modules, this GitHub issue suggests that the command itself is outdated.
Furthermore, in relation to why calico-node is not starting. I doubt the issue is related to missing modules since the flexvol init-container in itself refuses to even start, at which point, calico itself has not really started on the node yet to be able to complain?
This is all I could gather from k8s, I tried to lookup the message but hardly any concrete luck as to why this is not starting
Init Containers:
flexvol-driver:
Container ID:
Image: docker.io/calico/pod2daemon-flexvol:v3.24.1
Image ID:
Port: <none>
Host Port: <none>
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: StartError
Message: sandbox container "7efd1a9004167f93b06178010d01d3094544cdeb2f9e5495a804c1786563d82c" is not running
Exit Code: 128
Started: Wed, 31 Dec 1969 18:00:00 -0600
Finished: Wed, 17 Apr 2024 16:47:32 -0500
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/host/driver from flexvol-driver-host (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-79ghs (ro)
purposely unloaded nf_conntrack_netlink as it causes a crash when starting k3s agent with calico
Sure, but what is the cause? Buggy old kernel it seems. If you managed to start calico and k3s without conntrack, would you be ale to use policies meaningfully? I don't think so :shrug:
Any chance you can install a newer fixed kernel?
Right, it does seem like a buggy old kernel. I'm using Balena OS, I've put in a request for them to update the kernel version!
In the meanwhile I'll try outside of balena OS with a newer kernel provided by Google and let you know how that fares.
Did you make any progress? I am closing this issue now but feel free to reopen if you have new information.
Hi Tomas, Thank you for inquiring. I inevitably had to scrap the support plans for now and submitted an issue for Balena to update their coral-board-specific version of Balena-os to a more recent Linux that hopefully does not have issues with the kernel modules here.
Let me preface this by saying that this is an unusual setup scenario and that I am not running Calico in its ideal environment. If you do not care about the context as to why we try to start Calico without
nf_conntrack_netlink
, please skip over to the Expected and Current behavior headingsContext Environment
We are working on enrolling the google coral dev board onto our existing balena-fleet that runs a collection of raspberry pis and nvidia Jetson nanos in the following configuration:
After the above steps, the devices are able to start k3s agent in a container and wireguard in another and join our k3s cluster that is running Calico as its CNI.
Our process for enrolling the Google Coral Dev Board
Here are the logs for what happens when starting the k3s agent with ALL the kernel modules loaded
Dmesg logs on the host kernel
K3S agent logs (1.23.17 but also crashes on the latest stable)
Debugging the crash
After manually loading the kernel modules one by one, We managed to identify the kernel module that causes the crash:
nf_conntrack_netlink
. The K3S agent starts fine with all the other kernel modules loaded but crashes the kernel as soon as it is started with the offending kmod loaded. This is of course not an issue with Calico, though I would highly appreciate some help with figuring out how the crash couldExpected Behavior
the calico-node pod should start and possibly throw other errors related to the missing kernel module.
Current Behavior
After I start k3s agent without
nf_conntrack_netlink
, it managed to join the cluster. However, as expected, Calico refuses to start but I am unsure of the reasons why, here is a bullet summary of what I managed to gather: the calico-node pod fails to start its first init-containerflexvolidriver
. While K8s fails to gather the logs from containerd, we observe a crypticDestination directory /host/driver not present!?
when starting the container using thek3s ctr
utility to directly access containerd. This is the roadblock in calico's setup on the agent.Kubectl describe calico-node output
output of starting the flexvol container with containerd on the google coral dev board