projectcalico / calico

Cloud native networking and network security
https://docs.tigera.io/calico/latest/about/
Apache License 2.0
6.02k stars 1.34k forks source link

kernel BUG at net/core/skbuff.c:4044! #8771

Open dracoding opened 6 months ago

dracoding commented 6 months ago

The network mode of Calico is BGP. when enabling GRO and GSO, it will crash randomly.

Expected Behavior

Avoid crash when enable gro/gso.

Current Behavior

the stacktrace is as follows.

[16194369.907056] kernel BUG at net/core/skbuff.c:4044! [16194369.907097] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI [16194369.907116] CPU: 13 PID: 0 Comm: swapper/13 Kdump: loaded Tainted: G E 5.16.16-1.el7.elrepo.x86_64 #1 [16194369.907151] Hardware name: New H3C Technologies Co., Ltd. H3C UniServer R4900 G5/RS35M2C16SE, BIOS 5.66 08/02/2023 [16194369.907181] RIP: 0010:skb_segment+0xbc8/0xe00 [16194369.907203] Code: 01 e9 41 89 8e b8 00 00 00 e9 e7 fe ff ff 44 89 c0 39 54 24 7c 0f 86 21 ff ff ff 31 c9 8b 74 24 7c 29 d6 09 f1 e9 07 ff ff ff <0f> 0b a8 01 75 0d 48 81 38 70 b0 7d b9 0f 84 91 fa ff ff 48 8b 7c [16194369.907256] RSP: 0018:ffffa3f2cce08728 EFLAGS: 00010293 [16194369.907274] RAX: 000000000000007d RBX: 00000000fffff7b3 RCX: 0000000000000011 [16194369.907296] RDX: 0000000000000000 RSI: ffff895ea32c76c0 RDI: 00000000000008c1 [16194369.907317] RBP: ffffa3f2cce087f8 R08: 000000000000088f R09: 0000000000000011 [16194369.907338] R10: 000000000000090c R11: ffff895e47e68000 R12: ffff895eb2022f00 [16194369.907360] R13: 000000000000004b R14: ffff895ecdaf2000 R15: ffff895eb2023f00 [16194369.907381] FS: 0000000000000000(0000) GS:ffff899cbfb40000(0000) knlGS:0000000000000000 [16194369.907405] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [16194369.907423] CR2: 00007f6c6b9d6a38 CR3: 0000000128c34002 CR4: 0000000000770ee0 [16194369.907445] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [16194369.907466] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [16194369.907488] PKRU: 55555554 [16194369.907497] Call Trace: [16194369.907507] [16194369.907521] tcp_gso_segment+0xf0/0x520 [16194369.907538] tcp4_gso_segment+0x53/0xd0 [16194369.907552] inet_gso_segment+0x150/0x3c0 [16194369.907568] skb_mac_gso_segment+0xa1/0x120 [16194369.907585] skb_udp_tunnel_segment+0x259/0x5c0 [16194369.907601] udp4_ufo_fragment+0x131/0x190 [16194369.907616] inet_gso_segment+0x150/0x3c0 [16194369.907632] ? bpf_prog_df66ced5f148853b_calico_tc_skb_accepted_entrypoint+0x1a6c/0x2c6c [16194369.907658] skb_mac_gso_segment+0xa1/0x120 [16194369.907673] skb_gso_segment+0xce/0x190 [16194369.907687] ? netif_skb_features+0xc6/0x2c0 [16194369.907702] validate_xmit_skb+0x15e/0x2b0 [16194369.907716] dev_queue_xmit+0x234/0xc40 [16194369.907732] ? vlan_dev_hard_start_xmit+0x99/0xf0 [8021q] [16194369.907751] dev_queue_xmit+0x10/0x20 [16194369.907764] bpf_redirect+0x1a8/0x320 [16194369.907778] skb_do_redirect+0xed/0x100 [16194369.907793] __netif_receive_skb_core+0xe25/0xf70 [16194369.908489] ? dev_queue_xmit+0x10/0x20 [16194369.909143] ? bpf_redirect+0x1a8/0x320 [16194369.909761] netif_receive_skb_list_core+0x12a/0x2b0 [16194369.910362] netif_receive_skb_list_internal+0x1da/0x300 [16194369.910955] ? dev_gro_receive+0x1b3/0x3a0 [16194369.911565] gro_normal_list.part.0+0x1e/0x40 [16194369.912164] gro_normal_one+0x7c/0x90 [16194369.912754] napi_gro_complete+0x7c/0xe0 [16194369.913329] napi_gro_flush+0xb1/0x100 [16194369.913868] napi_complete_done+0xfe/0x190 [16194369.914401] ice_napi_poll+0x146/0x2a0 [ice] [16194369.914980] napi_poll+0x2e/0x150 [16194369.915477] net_rx_action+0x221/0x2d0 [16194369.915939] __do_softirq+0xdd/0x2c0 [16194369.916372] irq_exit_rcu+0xa4/0xc0 [16194369.916834] common_interrupt+0x8a/0xa0 [16194369.917254] [16194369.917663] [16194369.918062] asm_common_interrupt+0x1e/0x40 [16194369.918464] RIP: 0010:cpu_idle_poll+0x36/0x100

Possible Solution

Disabled GRO and GSO is active.

ethtool --offload eth0 gro off ethtool --offload eth0 gso off

Context

The patch mentioned in this https://github.com/projectcalico/calico/issues/6865 doesn't work for me.

analysis the vmcore, it was crashed at BUG_ON(skb_headlen(list_skb) > len).

The gso_size is 75, the frag_list has one element which head_frag is 1. the skb_shared_info struct is as following.

struct skb_shared_info { nr_frags = 17 '\021', gso_size = 75, gso_segs = 0, frag_list = 0xffff895eb2022f00, gso_type = 1035, destructor_arg = 0x2d656c6261747372, frags = {{ bv_page = 0xfffff80e86d4d180, bv_len = 125, bv_offset = 2306 }, .... } }

In BGP mode, the ebpf will call the bpf_skb_adjust_room() to adjust the gso_size?

Your Environment

Calico version: v3.24.5

matthewdupre commented 6 months ago

@tomastigera @sridhartigera any thoughts?

tomastigera commented 6 months ago

@dracoding what kernel do you use?

ebpf will call the bpf_skb_adjust_room() to adjust the gso_size?

yes, that should happen within the kernel automatically, outside of calico's code (so I assume you are using ebpf)

dracoding commented 6 months ago

@dracoding what kernel do you use?

my kernel version is 5.16.20.

ebpf will call the bpf_skb_adjust_room() to adjust the gso_size?

yes, that should happen within the kernel automatically, outside of calico's code (so I assume you are using ebpf)

yes, I'm using the calico with enabling the bpf. no ebpf outside of calico's code.

tomastigera commented 6 months ago

my kernel version is 5.16.20

what distro is it?

fasaxc commented 6 months ago

FWIW, a kernel BUG panic means there's a bug in the kernel, not in Calico. We'll do what we can but please can you report it to your distro vendor. To have a chance of figuring it out we;ll need to know exact details of the kernel/distro/hardware that you're using along with details of your workload that is causing the problem.

Please can you also try a more recent kernel, there have been bugs like this in the past, quite possible this one is already fixed upstream.

dracoding commented 6 months ago

what distro is it?

CentOS Linux release 7.8.2003 (Core)

dracoding commented 6 months ago

FWIW, a kernel BUG panic means there's a bug in the kernel, not in Calico. We'll do what we can but please can you report it to your distro vendor. To have a chance of figuring it out we;ll need to know exact details of the kernel/distro/hardware that you're using along with details of your workload that is causing the problem.

Please can you also try a more recent kernel, there have been bugs like this in the past, quite possible this one is already fixed upstream.

it was only happening in the cluster enabling calico ebpf mode, maybe this trigger the kernel bug. It doesn't happen frequently, maybe few months once. I'm not sure which workload will cause the problem.

I will try a more recent kernel but it may need a long time to test.

Distro: CentOS Linux release 7.8.2003 (Core). Kernel: 5.16.20 of the upstream. NetCard:Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02).

any hardware infomation i will provide.

ehsan310 commented 4 months ago

@dracoding have you hit this issue again on the other kernels?

@tomastigera We have also started getting this error, one of our workers went down and we were forced to reboot the machine.

we have almost latest kernel version from 2024-04 , Debian 12

dracoding commented 4 months ago

@dracoding have you hit this issue again on the other kernels?

I have tested on kernel 6.6.35, it also has the problem and I have submitted a patch to the community.

https://lore.kernel.org/all/20240626065555.35460-3-dracodingfly@gmail.com/