Open dracoding opened 6 months ago
@tomastigera @sridhartigera any thoughts?
@dracoding what kernel do you use?
ebpf will call the bpf_skb_adjust_room() to adjust the gso_size?
yes, that should happen within the kernel automatically, outside of calico's code (so I assume you are using ebpf)
@dracoding what kernel do you use?
my kernel version is 5.16.20.
ebpf will call the bpf_skb_adjust_room() to adjust the gso_size?
yes, that should happen within the kernel automatically, outside of calico's code (so I assume you are using ebpf)
yes, I'm using the calico with enabling the bpf. no ebpf outside of calico's code.
my kernel version is 5.16.20
what distro is it?
FWIW, a kernel BUG panic means there's a bug in the kernel, not in Calico. We'll do what we can but please can you report it to your distro vendor. To have a chance of figuring it out we;ll need to know exact details of the kernel/distro/hardware that you're using along with details of your workload that is causing the problem.
Please can you also try a more recent kernel, there have been bugs like this in the past, quite possible this one is already fixed upstream.
what distro is it?
CentOS Linux release 7.8.2003 (Core)
FWIW, a kernel BUG panic means there's a bug in the kernel, not in Calico. We'll do what we can but please can you report it to your distro vendor. To have a chance of figuring it out we;ll need to know exact details of the kernel/distro/hardware that you're using along with details of your workload that is causing the problem.
Please can you also try a more recent kernel, there have been bugs like this in the past, quite possible this one is already fixed upstream.
it was only happening in the cluster enabling calico ebpf mode, maybe this trigger the kernel bug. It doesn't happen frequently, maybe few months once. I'm not sure which workload will cause the problem.
I will try a more recent kernel but it may need a long time to test.
Distro: CentOS Linux release 7.8.2003 (Core). Kernel: 5.16.20 of the upstream. NetCard:Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02).
any hardware infomation i will provide.
@dracoding have you hit this issue again on the other kernels?
@tomastigera We have also started getting this error, one of our workers went down and we were forced to reboot the machine.
we have almost latest kernel version from 2024-04 , Debian 12
@dracoding have you hit this issue again on the other kernels?
I have tested on kernel 6.6.35, it also has the problem and I have submitted a patch to the community.
https://lore.kernel.org/all/20240626065555.35460-3-dracodingfly@gmail.com/
The network mode of Calico is BGP. when enabling GRO and GSO, it will crash randomly.
Expected Behavior
Avoid crash when enable gro/gso.
Current Behavior
the stacktrace is as follows.
Possible Solution
Disabled GRO and GSO is active.
Context
The patch mentioned in this https://github.com/projectcalico/calico/issues/6865 doesn't work for me.
analysis the vmcore, it was crashed at BUG_ON(skb_headlen(list_skb) > len).
The gso_size is 75, the frag_list has one element which head_frag is 1. the skb_shared_info struct is as following.
In BGP mode, the ebpf will call the bpf_skb_adjust_room() to adjust the gso_size?
Your Environment
Calico version: v3.24.5