Open maxpain opened 2 weeks ago
I have kernel panic almost every day on Talos Linux v1.8.2 (Linux 6.6.58). Talos is deployed on bare metal nodes (Dell R6615) with NVMe SSD. For the network, I use Broadcom 2x25G (50G in LACP bonding) with MTU 9000 (jumbo frame).
I use an image built on factory.talos.dev:
customization: extraKernelArgs: - console=ttyS0,115200n8r - -lockdown - lockdown=integrity - cpufreq.default_governor=performance - amd_pstate=active - mitigations=off - iommu=off systemExtensions: officialExtensions: - siderolabs/amd-ucode - siderolabs/amdgpu-firmware - siderolabs/drbd
For CNI I use Cilium in eBPF mode.
[40145.614353] general protection fault, probably for non-canonical address 0x9e759c37ee555c76: 0000 [#1] SMP PTI [40145.624361] CPU: 18 PID: 234918 Comm: conn48291 Tainted: G O 6.6.58-talos #1 [40145.632800] Hardware name: Dell Inc. PowerEdge R6615/067N9T, BIOS 1.9.5 09/12/2024 [40145.640376] RIP: 0010:is_uprobe_at_func_entry+0x28/0x80 [40145.645609] Code: 90 90 0f 1f 44 00 00 65 48 8b 04 25 80 e3 02 00 48 83 b8 30 0b 00 00 00 74 60 48 8b 80 30 0b 00 00 48 8b 50 30 48 85 d2 74 50 <80> 3a 55 b8 01 00 00 00 74 1b 48 8b 8f 88 00 00 00 48 83 f9 33 74 [40145.664366] RSP: 0018:ffffc900007c8bc8 EFLAGS: 00010082 [40145.669599] RAX: ffff88813eafb120 RBX: ffffc900007c8c20 RCX: 00007f116e206296 [40145.676740] RDX: 9e759c37ee555c76 RSI: 0000000000000001 RDI: ffffc90111fa3f58 [40145.683880] RBP: ffffc90111fa3f58 R08: 000000000002aee0 R09: 0000000000000008 [40145.691021] R10: ffffc90111fa0000 R11: ffffc900007c8ff8 R12: 0000000000000000 [40145.698162] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [40145.705303] FS: 00007f113e959700(0000) GS:ffff88defb500000(0000) knlGS:0000000000000000 [40145.713398] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [40145.719155] CR2: 000015b40194c804 CR3: 0000000363b74003 CR4: 0000000000f70ee0 [40145.726294] PKRU: 55555554 [40145.729014] Call Trace: [40145.731468] <IRQ> [40145.733502] ? die_addr+0x36/0x90 [40145.736836] ? exc_general_protection+0x217/0x420 [40145.741553] ? asm_exc_general_protection+0x26/0x30 [40145.746450] ? is_uprobe_at_func_entry+0x28/0x80 [40145.751083] perf_callchain_user+0x20a/0x360 [40145.755365] get_perf_callchain+0x147/0x1d0 [40145.759559] bpf_get_stackid+0x60/0x90 [40145.763319] bpf_prog_9aac297fb833e2f5_do_perf_event+0x434/0x53b [40145.769333] ? __smp_call_single_queue+0xad/0x120 [40145.774049] bpf_overflow_handler+0x75/0x110 [40145.778330] __perf_event_overflow+0x114/0x360 [40145.782787] perf_swevent_hrtimer+0x134/0x150 [40145.787155] ? __wake_up_common+0x73/0x180 [40145.791258] ? timerqueue_del+0x2e/0x50 [40145.795107] ? __pfx_perf_swevent_hrtimer+0x10/0x10 [40145.799996] __hrtimer_run_queues+0x118/0x240 [40145.804365] ? ktime_get_update_offsets_now+0x49/0x110 [40145.809511] hrtimer_interrupt+0xf8/0x240 [40145.813531] __sysvec_apic_timer_interrupt+0x4a/0xe0 [40145.818508] sysvec_apic_timer_interrupt+0x6d/0x90 [40145.823310] </IRQ> [40145.825426] <TASK> [40145.827537] asm_sysvec_apic_timer_interrupt+0x1a/0x20 [40145.832687] RIP: 0010:__kmem_cache_free+0x1cb/0x350 [40145.837576] Code: 48 85 db 0f 84 00 01 00 00 48 89 c2 48 0f ca 49 33 94 24 b8 00 00 00 48 89 10 49 8b 04 24 65 48 03 05 99 bd 37 61 48 8b 70 08 <4c> 39 68 10 0f 85 0b 01 00 00 48 8b 10 41 8b 44 24 28 48 01 d8 48 [40145.856331] RSP: 0018:ffffc90111fa3b70 EFLAGS: 00000282 [40145.861561] RAX: ffff88defb533910 RBX: ffff88813eafb120 RCX: ffffea0000000000 [40145.868698] RDX: 9e759c37ee555c76 RSI: 0000000000119862 RDI: ffff88810004e200 [40145.875836] RBP: ffffc90111fa3bc0 R08: 0000000000000086 R09: 00007f1153f9f9c0 [40145.882980] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88810004e200 [40145.890120] R13: ffffea0004fabec0 R14: 0000000000000000 R15: 0000000000000000 [40145.897266] ? uprobe_free_utask+0x62/0x80 [40145.901378] ? acct_collect+0x4c/0x220 [40145.905141] uprobe_free_utask+0x62/0x80 [40145.909075] mm_release+0x12/0xb0 [40145.912401] do_exit+0x26b/0xaa0 [40145.915643] __x64_sys_exit+0x1b/0x20 [40145.919317] do_syscall_64+0x5a/0x80 [40145.922911] entry_SYSCALL_64_after_hwframe+0x78/0xe2 [40145.927976] RIP: 0033:0x7f116e206296 [40145.931565] Code: 28 06 00 00 0f 84 ec 01 00 00 48 8b 44 24 08 f6 80 08 03 00 00 40 0f 85 7a 01 00 00 ba 3c 00 00 00 0f 1f 00 31 ff 89 d0 0f 05 <eb> f8 48 89 c8 48 c7 00 00 00 00 00 48 8d 48 f8 48 39 d0 75 ed 48 [40145.950321] RSP: 002b:00007f113e958a40 EFLAGS: 00000246 ORIG_RAX: 000000000000003c [40145.957891] RAX: ffffffffffffffda RBX: 00007f113e859000 RCX: 00007f116e206296 [40145.965033] RDX: 000000000000003c RSI: 00007f1153f9f9c0 RDI: 0000000000000000 [40145.972177] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000056b90006 [40145.979317] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f1149f8925e [40145.986456] R13: 00007f1149f8925f R14: 00007f113e959700 R15: 00007f113e958b00 [40145.993606] </TASK> [40145.995808] Modules linked in: drbd_transport_tcp(O) drbd(O) ahci i40e sp5100_tco bnxt_en amd64_edac megaraid_sas libahci nvme k10temp watchdog [40146.008673] ---[ end trace 0000000000000000 ]--- [40146.013298] RIP: 0010:is_uprobe_at_func_entry+0x28/0x80 [40146.018531] Code: 90 90 0f 1f 44 00 00 65 48 8b 04 25 80 e3 02 00 48 83 b8 30 0b 00 00 00 74 60 48 8b 80 30 0b 00 00 48 8b 50 30 48 85 d2 74 50 <80> 3a 55 b8 01 00 00 00 74 1b 48 8b 8f 88 00 00 00 48 83 f9 33 74 [40146.037290] RSP: 0018:ffffc900007c8bc8 EFLAGS: 00010082 [40146.042521] RAX: ffff88813eafb120 RBX: ffffc900007c8c20 RCX: 00007f116e206296 [40146.049662] RDX: 9e759c37ee555c76 RSI: 0000000000000001 RDI: ffffc90111fa3f58 [40146.056805] RBP: ffffc90111fa3f58 R08: 000000000002aee0 R09: 0000000000000008 [40146.063946] R10: ffffc90111fa0000 R11: ffffc900007c8ff8 R12: 0000000000000000 [40146.071088] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [40146.078227] FS: 00007f113e959700(0000) GS:ffff88defb500000(0000) knlGS:0000000000000000 [40146.086321] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [40146.092077] CR2: 000015b40194c804 CR3: 0000000363b74003 CR4: 0000000000f70ee0 [40146.099222] PKRU: 55555554 [40146.101943] Kernel panic - not syncing: Fatal exception in interrupt [40146.108739] Kernel Offset: disabled [40146.112246] Rebooting in 10 seconds..
This seems to involve bpf code, so probably Cilium CNI?
bpf
Anyways let's wait for the next Linux release.
Yes
Maybe the panic caused by Coroot: https://github.com/coroot/pyroscope/blob/6e4a1fd70266628af60fc11f5ccb12267dbb9dd6/pkg/agent/ebpfspy/bpf/profile.bpf.c#L54
https://github.com/coroot/coroot/issues/377
I have kernel panic almost every day on Talos Linux v1.8.2 (Linux 6.6.58). Talos is deployed on bare metal nodes (Dell R6615) with NVMe SSD. For the network, I use Broadcom 2x25G (50G in LACP bonding) with MTU 9000 (jumbo frame).
I use an image built on factory.talos.dev:
For CNI I use Cilium in eBPF mode.