projectcalico / calico

Cloud native networking and network security
https://docs.tigera.io/calico/latest/about/
Apache License 2.0
6.02k stars 1.34k forks source link

calico netatop exception cause node restart "exception RIP: kmem_cache_alloc" #7974

Closed ming12713 closed 1 year ago

ming12713 commented 1 year ago

Hello, i have fews baremeta server frequent unexpected restart. An analysis using kdump indicates that the abnormal triggering of Calico is the cause. Screenshot from 2023-08-30 11-07-29

dmesg error

[ 2376.974624] IPv6: ADDRCONF(NETDEV_CHANGE): cali9bef92c240d: link becomes ready
[ 2379.510962] general protection fault, probably for non-canonical address 0x8b30ba20faa16f31: 0000 [#1] SMP NOPTI                                                                                                
[ 2379.511009] CPU: 17 PID: 339161 Comm: calico Kdump: loaded Tainted: G           OE     5.15.0-79-generic #86-Ubuntu                                                                                             
[ 2379.511041] Hardware name: Dell Inc. PowerEdge R650/0Y2G81, BIOS 1.8.2 09/14/2022
[ 2379.511062] RIP: 0010:kmem_cache_alloc+0xfd/0x2f0                                                                                                                                                               [ 2379.511083] Code: 8b 50 08 49 8b 00 49 83 78 10 00 48 89 45 c8 0f 84 96 01 00 00 48 85 c0 0f 84 8d 01 00 00 41 8b 4c 24 28 49 8b 3c 24 48 01 c1 <48> 8b 19 48 89 ce 49 33 9c 24 b8 00 00 00 48 8d 4a 01 48 0f ce
 48                                                                                                      
[ 2379.511136] RSP: 0018:ff5e9a946e83b820 EFLAGS: 00010092                       
[ 2379.511154] RAX: 8b30ba20faa16ef9 RBX: 0000000000000078 RCX: 8b30ba20faa16f31 
[ 2379.511176] RDX: 0000000000000170 RSI: 0000000000000a20 RDI: 007bb5963f23d8e0 
[ 2379.511197] RBP: ff5e9a946e83b860 R08: ff909a943e43d8e0 R09: ff14e4cec6f00000 
[ 2379.511217] R10: 0000000000052cd9 R11: 0000000000000006 R12: ff14e4ee4ddd2d00 
[ 2379.511238] R13: 0000000000000000 R14: 0000000000000a20 R15: 0000000000000a20
[ 2379.511258] FS:  000000c000780490(0000) GS:ff14e4fdff200000(0000) knlGS:0000000000000000                                                                                                                        [ 2379.511282] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033                                                                                                                                                   [ 2379.511299] CR2: 000000c001007740 CR3: 00000024457ae003 CR4: 0000000000771ee0                                                                                                                                   [ 2379.511320] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000                                                                                                                                   [ 2379.511341] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400                                                                                                                                   
[ 2379.511361] PKRU: 55555554                                                                                                                                                                                      [ 2379.511371] Call Trace:                                                                                                                                                                                         [ 2379.511381]  <TASK>                                                                                                                                                                                             [ 2379.511391]  ? get_taskinfo+0xac/0x1b0 [netatop]                                                                                                                                                                
[ 2379.511411]  get_taskinfo+0xac/0x1b0 [netatop]                                                                                                                                                                  
[ 2379.511427]  sock2task+0x1ac/0x480 [netatop]
[ 2379.511443]  analyze_tcpv4_packet+0x1bd/0x210 [netatop]
[ 2379.511465]  ipv4_hookout+0x86/0xf0 [netatop]
[ 2379.511482]  nf_hook_slow+0x41/0xc0
[ 2379.511501]  __ip_local_out+0xd6/0x150
[ 2379.511517]  ? ip_output+0x100/0x100
[ 2379.511533]  ip_local_out+0x1d/0x70
[ 2379.511548]  __ip_queue_xmit+0x184/0x440
[ 2379.511565]  ip_queue_xmit+0x15/0x20
[ 2379.511579]  __tcp_transmit_skb+0x910/0x9c0
[ 2379.511599]  tcp_write_xmit+0x3e9/0xb40
[ 2379.511615]  ? __check_object_size.part.0+0x4a/0x150
[ 2379.511638]  __tcp_push_pending_frames+0x37/0x100 
[ 2379.511657]  tcp_push+0xd9/0x110
[ 2379.511670]  tcp_sendmsg_locked+0x89a/0xc90
[ 2379.511687]  tcp_sendmsg+0x2d/0x50
[ 2379.512488]  inet_sendmsg+0x43/0x80
[ 2379.513269]  sock_sendmsg+0x62/0x70
[ 2379.514042]  sock_write_iter+0x93/0xf0
[ 2379.514786]  new_sync_write+0x18d/0x1a0
[ 2379.515505]  vfs_write+0x1d5/0x270
[ 2379.516196]  ksys_write+0xb5/0xf0
[ 2379.516860]  __x64_sys_write+0x19/0x20
[ 2379.511579]  __tcp_transmit_skb+0x910/0x9c0
[ 2379.511599]  tcp_write_xmit+0x3e9/0xb40
[ 2379.511615]  ? __check_object_size.part.0+0x4a/0x150
[ 2379.511638]  __tcp_push_pending_frames+0x37/0x100 
[ 2379.511657]  tcp_push+0xd9/0x110
[ 2379.511670]  tcp_sendmsg_locked+0x89a/0xc90
[ 2379.511687]  tcp_sendmsg+0x2d/0x50
[ 2379.512488]  inet_sendmsg+0x43/0x80
[ 2379.513269]  sock_sendmsg+0x62/0x70
[ 2379.514042]  sock_write_iter+0x93/0xf0
[ 2379.514786]  new_sync_write+0x18d/0x1a0
[ 2379.515505]  vfs_write+0x1d5/0x270
[ 2379.516196]  ksys_write+0xb5/0xf0
[ 2379.516860]  __x64_sys_write+0x19/0x20
[ 2379.517502]  do_syscall_64+0x59/0xc0
[ 2379.518119]  ? syscall_exit_to_user_mode+0x27/0x50
[ 2379.518720]  ? do_syscall_64+0x69/0xc0
[ 2379.519300]  ? do_syscall_64+0x69/0xc0
[ 2379.519852]  ? irqentry_exit+0x1d/0x30
[ 2379.520384]  ? sysvec_reschedule_ipi+0x78/0xe0
[ 2379.520909]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
[ 2379.521423] RIP: 0033:0x403ace
[ 2379.521912] Code: 48 89 6c 24 38 48 8d 6c 24 38 e8 0d 00 00 00 48 8b 6c 24 38 48 83 c4 40 c3 cc cc cc 49 89 f2 48 89 fa 48 89 ce 48 89 df 0f 05 <48> 3d 01 f0 ff ff 76 15 48 f7 d8 48 89 c1 48 c7 c0 ff ff ff ff
 48
[ 2379.522950] RSP: 002b:000000c000ce9640 EFLAGS: 00000206 ORIG_RAX: 0000000000000001
[ 2379.523484] RAX: ffffffffffffffda RBX: 0000000000000007 RCX: 0000000000403ace
[ 2379.524017] RDX: 000000000000008f RSI: 000000c0002f4000 RDI: 0000000000000007
[ 2379.524546] RBP: 000000c000ce9680 R08: 0000000000000000 R09: 0000000000000000
[ 2379.525073] R10: 0000000000000000 R11: 0000000000000206 R12: 000000c000ce97c0
[ 2379.525619] R13: 0000000000000000 R14: 000000c000517ba0 R15: 000000c000780400
[ 2379.526129]  </TASK>
[ 2379.526622] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache netfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi vxlan xt_set ipt_rpfilter ip_set_hash_ip ip_set_hash_net ip_
set xfrm_user xfrm_algo wireguard curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libcurve25519_generic libchacha ip6_udp_tunnel udp_tunnel xt_multiport veth nf_conntrack_netlink xt_recent xt
_nat xt_statistic xt_addrtype ipt_REJECT nf_reject_ipv4 xt_tcpudp xt_MASQUERADE nft_chain_nat nf_nat xt_mark xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_comment nft_compat nft_counter nf_tables nf
netlink netatop(OE) sunrpc binfmt_misc nls_iso8859_1 xfs intel_rapl_msr intel_rapl_common i10nm_edac nfit x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel ipmi_ssif kvm rapl dell_wmi ledtrig_audio sparse
_keymap intel_cstate video dell_smbios dcdbas dell_wmi_descriptor wmi_bmof isst_if_mbox_pci isst_if_mmio mei_me isst_if_common mei intel_pch_thermal acpi_ipmi
[ 2379.526686]  ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter mac_hid sch_fq_codel dm_multipath scsi_dh_rdac br_netfilter scsi_dh_emc scsi_dh_alua bridge stp llc overlay msr ramoops reed_solomon pstore_b
lk pstore_zone efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath lin
ear mlx5_ib ib_uverbs ib_core mgag200 i2c_algo_bit drm_kms_helper crct10dif_pclmul syscopyarea crc32_pclmul sysfillrect sysimgblt ghash_clmulni_intel fb_sys_fops mlx5_core cec aesni_intel mlxfw psample crypto_si
md i2c_i801 xhci_pci rc_core tls cryptd ahci drm megaraid_sas tg3 pci_hyperv_intf i2c_smbus libahci xhci_pci_renesas intel_pmt wmi

Your Environment

ming12713 commented 1 year ago

update calico-ipam hang coredump errors Screenshot from 2023-08-30 13-42-45

lwr20 commented 1 year ago

Isn't this a kernel bug? It should never be possible for a user-space program to cause the kernel to crash.

Can you raise with your kernel/distro provider please?

ming12713 commented 1 year ago

Isn't this a kernel bug? It should never be possible for a user-space program to cause the kernel to crash.

Can you raise with your kernel/distro provider please?

ok,thanks