intel / FSP

Intel(R) Firmware Support Package (FSP)
Other
295 stars 127 forks source link

Broadwell-DE CPUs soft hang issues #44

Open anpetrovfb opened 4 years ago

anpetrovfb commented 4 years ago

We've determined certain SKUs with 32 CPUs exhibit soft lock up issues. This is semi-random but warm reboot tests reliably start encountering it on 20-30 iterations. This manifests itself in the following kernel messages:

[   38.084920] INFO: rcu_sched detected stalls on CPUs/tasks:
[   38.084924]  2-...: (20978 ticks this GP) idle=e23/140000000000001/0 softirq=98/98 fqs=5250
[   38.084925]  (detected by 28, t=21002 jiffies, g=-224, c=-225, q=24)
[   38.084927] Sending NMI from CPU 28 to CPUs 2:
[   38.085930] NMI backtrace for cpu 2
[   38.085931] CPU: 2 PID: 1 Comm: swapper/0 Not tainted 4.11.3-41_fbk10_3544_gea63179 #41
[   38.085931] Hardware name: Open Compute Project Mono Lake/Mono Lake, BIOS 4.10-1401-g6a657c2646-dirty 11/05/2019
[   38.085931] task: ffff88085b678000 task.stack: ffffc90003130000
[   38.085932] RIP: 0010:delay_tsc+0x35/0x50
[   38.085932] RSP: 0000:ffff88085ee83c30 EFLAGS: 00000097
[   38.085933] RAX: 000000789b9285bb RBX: ffffffff82311f80 RCX: 000000789b927ee9
[   38.085933] RDX: 00000000000006d2 RSI: 0000000000000002 RDI: 0000000000000706
[   38.085934] RBP: ffff88085ee83c40 R08: 0000000000000010 R09: 0000000000000000
[   38.085934] R10: 00000000000002bb R11: 0000000000000000 R12: 00000000000026e9
[   38.085934] R13: 0000000000000020 R14: ffffffff820d1388 R15: 0000000000000034
[   38.085935] FS:  0000000000000000(0000) GS:ffff88085ee80000(0000) knlGS:0000000000000000
[   38.085935] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   38.085935] CR2: ffffc90003944000 CR3: 000000007de09000 CR4: 00000000003406e0
[   38.085936] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   38.085936] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   38.085936] Call Trace:
[   38.085936]  <IRQ>
[   38.085937]  ? __const_udelay+0x32/0x40
[   38.085937]  wait_for_xmitr+0x2c/0xa0
[   38.085937]  serial8250_console_putchar+0x1c/0x30
[   38.085937]  ? wait_for_xmitr+0xa0/0xa0
[   38.085938]  uart_console_write+0x30/0x70
[   38.085938]  serial8250_console_write+0xa1/0x200
[   38.085938]  ? msg_print_text+0xa2/0x110
[   38.085939]  univ8250_console_write+0x22/0x30
[   38.085939]  console_unlock+0x3e5/0x520
[   38.085939]  vprintk_emit+0x225/0x2c0
[   38.085939]  vprintk_default+0x1f/0x30
[   38.085940]  vprintk_func+0x27/0x60
[   38.085940]  printk+0x43/0x4b
[   38.085940]  rcu_check_callbacks+0x465/0x8b0
[   38.085940]  ? account_system_index_time+0x8c/0xa0
[   38.085941]  ? tick_nohz_handler+0xf0/0xf0
[   38.085941]  update_process_times+0x57/0xa0
[   38.085941]  tick_sched_timer+0x57/0xd0
[   38.085941]  __hrtimer_run_queues+0xd8/0x220
[   38.085942]  hrtimer_interrupt+0xab/0x190
[   38.085942]  ? unmap_pmd_range+0x2d0/0x2d0
[   38.085942]  smp_apic_timer_interrupt+0x63/0x90
[   38.085942]  apic_timer_interrupt+0x86/0x90

The solution of this issue is to rebuild FSP with maximum number of CPUs bumped to 32.