Open anpetrovfb opened 4 years ago
We've determined certain SKUs with 32 CPUs exhibit soft lock up issues. This is semi-random but warm reboot tests reliably start encountering it on 20-30 iterations. This manifests itself in the following kernel messages:
[ 38.084920] INFO: rcu_sched detected stalls on CPUs/tasks: [ 38.084924] 2-...: (20978 ticks this GP) idle=e23/140000000000001/0 softirq=98/98 fqs=5250 [ 38.084925] (detected by 28, t=21002 jiffies, g=-224, c=-225, q=24) [ 38.084927] Sending NMI from CPU 28 to CPUs 2: [ 38.085930] NMI backtrace for cpu 2 [ 38.085931] CPU: 2 PID: 1 Comm: swapper/0 Not tainted 4.11.3-41_fbk10_3544_gea63179 #41 [ 38.085931] Hardware name: Open Compute Project Mono Lake/Mono Lake, BIOS 4.10-1401-g6a657c2646-dirty 11/05/2019 [ 38.085931] task: ffff88085b678000 task.stack: ffffc90003130000 [ 38.085932] RIP: 0010:delay_tsc+0x35/0x50 [ 38.085932] RSP: 0000:ffff88085ee83c30 EFLAGS: 00000097 [ 38.085933] RAX: 000000789b9285bb RBX: ffffffff82311f80 RCX: 000000789b927ee9 [ 38.085933] RDX: 00000000000006d2 RSI: 0000000000000002 RDI: 0000000000000706 [ 38.085934] RBP: ffff88085ee83c40 R08: 0000000000000010 R09: 0000000000000000 [ 38.085934] R10: 00000000000002bb R11: 0000000000000000 R12: 00000000000026e9 [ 38.085934] R13: 0000000000000020 R14: ffffffff820d1388 R15: 0000000000000034 [ 38.085935] FS: 0000000000000000(0000) GS:ffff88085ee80000(0000) knlGS:0000000000000000 [ 38.085935] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 38.085935] CR2: ffffc90003944000 CR3: 000000007de09000 CR4: 00000000003406e0 [ 38.085936] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 38.085936] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 38.085936] Call Trace: [ 38.085936] <IRQ> [ 38.085937] ? __const_udelay+0x32/0x40 [ 38.085937] wait_for_xmitr+0x2c/0xa0 [ 38.085937] serial8250_console_putchar+0x1c/0x30 [ 38.085937] ? wait_for_xmitr+0xa0/0xa0 [ 38.085938] uart_console_write+0x30/0x70 [ 38.085938] serial8250_console_write+0xa1/0x200 [ 38.085938] ? msg_print_text+0xa2/0x110 [ 38.085939] univ8250_console_write+0x22/0x30 [ 38.085939] console_unlock+0x3e5/0x520 [ 38.085939] vprintk_emit+0x225/0x2c0 [ 38.085939] vprintk_default+0x1f/0x30 [ 38.085940] vprintk_func+0x27/0x60 [ 38.085940] printk+0x43/0x4b [ 38.085940] rcu_check_callbacks+0x465/0x8b0 [ 38.085940] ? account_system_index_time+0x8c/0xa0 [ 38.085941] ? tick_nohz_handler+0xf0/0xf0 [ 38.085941] update_process_times+0x57/0xa0 [ 38.085941] tick_sched_timer+0x57/0xd0 [ 38.085941] __hrtimer_run_queues+0xd8/0x220 [ 38.085942] hrtimer_interrupt+0xab/0x190 [ 38.085942] ? unmap_pmd_range+0x2d0/0x2d0 [ 38.085942] smp_apic_timer_interrupt+0x63/0x90 [ 38.085942] apic_timer_interrupt+0x86/0x90
The solution of this issue is to rebuild FSP with maximum number of CPUs bumped to 32.
We've determined certain SKUs with 32 CPUs exhibit soft lock up issues. This is semi-random but warm reboot tests reliably start encountering it on 20-30 iterations. This manifests itself in the following kernel messages:
The solution of this issue is to rebuild FSP with maximum number of CPUs bumped to 32.