projectacrn / acrn-hypervisor

Project ACRN hypervisor
BSD 3-Clause "New" or "Revised" License
1.15k stars 524 forks source link

SOS CPU stall #8744

Open yichongt opened 1 month ago

yichongt commented 1 month ago

Describe the bug When SOS share vCPU with Waag, SOS may encounter kernel panic that one of its CPU stuck and stall for a long time, which will cause system reboot.

Platform RPL-S 13700E

Codebase Both 3.2 release and 3.3 release

Scenario SOS share CPU with all Waag vCPU with out own_pcpu checked.

To Reproduce

  1. Boot Waag
  2. Run Passmark benchmark CPU test for several iteration

Expected behavior Waag will not stuck during benchmarking

Additional context SOS kernel demsg output in ACRN console: [14078.492047] rcu: INFO: rcu_preempt self-detected stall on CPU [14078.492220] rcu: 5-....: (12571 ticks this GP) idle=7b94/1/0x4000000000000000 softirq=678356/678356 fqs=4205 [14078.492047] rcu: INFO: rcu_preempt self-detected stall on CPU [14078.492220] rcu: 5-....: (12571 ticks this GP) idle=7b94/1/0x4000000000000000 softirq=678356/678356 fqs=4205 [14078.492452] (t=21000 jiffies g=1086585 q=128 ncpus=6) [14078.492455] CPU: PID: 315330 Comm: snap-confine Tainted: G U 6.1.80-acrn-service-vm-375513-g7159ad071be8 #1 [14078.492459] Hardware name: Default string Default string/Default string, BIOS 5.27 06/14/2023 [14078.492460] RIP: 0010:smp_call_function_many_cond+0xfd/0x2e0 [14078.492466] Code: d0 48 89 df e8 b4 d3 5a 00 39 05 4e de fc 01 76 b0 48 63 d0 49 8b 0c 24 48 03 0c d5 c0 e8 c6 94 8b 51 08 83 e2 01 74 0a f3 90 <8b> 51 08 83 e2 01 75 f6 83 c0 01 eb c1 9c 58 fa f6 c4 02 0f 85 8f [14078.492468] RSP: 0018:ffffb2c680f43ba0 EFLAGS: 00000202 [14078.492471] RAX: 0000000000000002 RBX[14106.017626] watchdog: BUG: soft lockup - CPU#5 stuck for 49s! [snap-confine:315330] [14106.017960] Kernel panic - not syncing: softlockup: hung tasks [14106.018095] CPU: 5 PID: 315330 Comm: snap-confine Tainted: G U L 6.1.80-acrn-service-vm-375513-g7159ad071be8 #1 [14106.018344] Hardware name: Default string Default string/Default string, BIOS 5.27 06/14/2023 [14106.018533] Call Trace: [14106.018594] [14106.018646] dump_stack_lvl+0x49/0x62 [14106.018734] dump_stack+0x10/0x16 [14106.018815] panic+0x114/0x29a [14106.018891] watchdog_timer_fn.cold.14+0xc/0x16 [14106.019000] ? softlockup_fn+0x30/0x30 [14106.019089] hrtimer_run_queues+0xa5/0x2c0 [14106.019191] hrtimer_interrupt+0xf6/0x220 [14106.019286] sysvec_apic_timer_interrupt+0x5f/0x110 [14106.019404] sysvec_apic_timer_interrupt+0x6f/0xa0 [14106.019517] [14106.019570] [14106.019624] asm_sysvec_apic_timer_interrupt+0x1b/0x20 [14106.019744] RIP: 0010:smp_call_function_many_cond+0xfd/0x2e0 [14106.019875] Code: d0 48 89 df e8 b4 d3 5a 00 39 05 4e de fc 01 76 b0 48 63 d0 49 8b 0c 24 48 03 0c d5 c0 e8 c6 94 8b 51 08 83 e2 01 74 0a f3 90 <8b> 51 08 83 e2 01 75 f6 83 c0 01 eb c1 9c 58 fa f6 c4 02 0f 85 8f [14106.020280] RSP: 0018:ffffb2c680f43ba0 EFLAGS: 00000202 [14106.020400] RAX: 0000000000000002 RBX: ffffa2ec4856b488 RCX: ffffa2ec484adb40 [14106.020561] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffffa2ec4856b488 [14106.020722] RBP: ffffb2c680f43c10 R08: 0000000000000002 R09: ffffa2ec4856b490 [14106.020882] R10: ffffb2c680f43dc0 R11: 0000000000000000 R12: ffffa2ec4856b480 [14106.021042] R13: 0000000000000001 R14: 000000000002db40 R15: 0000000000000008 [14106.021191] ? flush_tlb_all+0x30/0x30 [14106.021270] on_each_cpu_cond_mask+0x29/0x50 [14106.021354] flush_tlb_kernel_range+0x41/0xc0 [14106.021441] purge_vmap_area_lazy+0xba/0x6e0 [14106.021529] ? purge_fragmented_blocks_allcpus+0x40/0x220 [14106.021632] _vm_unmap_aliases+0x116/0x150 [14106.021713] vm_unmap_aliases+0x19/0x20 [14106.021788] change_page_attr_set_clr+0xa0/0x290 [14106.021880] set_memory_ro+0x29/0x30 [14106.021953] bpf_prog_select_runtime+0x11e/0x130 [14106.022044] bpf_prepare_filter+0x541/0x5c0 [14106.022127] bpf_prog_create_from_user+0xc5/0x110 [14106.022220] ? hardlockup_detector_perf_cleanup+0xa0/0xa0 [14106.022324] do_seccomp+0x2c8/0xad0 [14106.022394] __x64_sys_seccomp+0x1a/0x20 [14106.022472] do_syscall_64+0x37/0x90 [14106.022544] entry_SYSCALL_64_after_hwframe+0x64/0xce [14106.022642] RIP: 0033:0x7ffb6c51e88d [14106.022713] Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 73 b5 0f 00 f7 d8 64 89 01 48 [14106.023050] RSP: 002b:00007fff4ec7cb68 EFLAGS: 00000246 ORIG_RAX: 000000000000013d [14106.023192] RAX: ffffffffffffffda RBX: 00007ffb6c7d36b0 RCX: 00007ffb6c51e88d [14106.023325] RDX: 00007fff4ec7cba0 RSI: 0000000000000002 RDI: 0000000000000001 [14106.023459] RBP: 0000558781704430 R08: 00005587817121b0 R09: 00007fff4ec7cba0 [14106.023592] R10: 0000000000000c00 R11: 0000000000000246 R12: 00007fff4ec7cba0 [14106.023726] R13: 00007fff4ec7cba0 R14: 00005587817032a0 R15: 00007fff4ec7e958 [14106.023860] [14108.160285] Shutting down cpus with NMI [14108.189043] Kernel Offset: 0x12000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) [14127.067252] Rebooting in 10 seconds..

Call trace and stuck CPU may be different, but kernel panic type is the same every time.