open-power / skiboot

OPAL boot and runtime firmware for POWER
Apache License 2.0
99 stars 134 forks source link

fast-reset: NMI watchdog: BUG: soft lockup - CPU#113 stuck for 22s! [kopald:854] #136

Open ghost opened 6 years ago

ghost commented 6 years ago

(bringing over from internal bugzilla)

Pridhiviraj Paidipeddi 2016-12-24 00:15:00 AEDT:

Install a 8284-22A PowerNV system with FW860.20 latest firmware build, and install a Ubuntu 16.10. Then below test steps caused the NMI Watchdog: Bug soft lockup issue in petitboot kernel.

Test-steps:

  1. Boot the system to runtime(i,e into petitboot)

  2. Enable the fast-reboot by using below command / # nvram -p ibm,skiboot --update-config experimental-fast-reset=feeling-lucky / # nvram --print-config=experimental-fast-reset -p ibm,skiboot feeling-lucky

  3. Then do a FSP ResetReload test smgr resetReload

  4. Once FSP comes UP, do reboot in petitboot.

  5. After system reaching to peitiboot, check dmesg for softlockup's

[   32.107804] NMI watchdog: BUG: soft lockup - CPU#113 stuck for 22s! [kopald:854]
[   32.107894] Modules linked in: sr_mod sd_mod cdrom lpfc ipr crc_t10dif crct10dif_generic crct10dif_common
[   32.107903] CPU: 113 PID: 854 Comm: kopald Not tainted 4.4.27-openpower1 #1
[   32.107905] task: c000002fc61a5ef0 ti: c000002fc621c000 task.ti: c000002fc621c000
[   32.107907] NIP: c00000000000d8dc LR: c00000000000d8dc CTR: c000000000015d00
[   32.107908] REGS: c000002fc621f630 TRAP: 0901   Not tainted  (4.4.27-openpower1)
[   32.107909] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 42000024  XER: 00000000
[   32.107915] CFAR: c0000000000a5448 SOFTE: 1 
[   32.107915] GPR00: c000000000059534 c000002fc621f8b0 c00000000089a500 0000000000000900 
[   32.107915] GPR04: 0000000000000000 0000000022b2980d 0000000004898d98 0000000000000001 
[   32.107915] GPR08: 0000000000000000 0000000000000002 000000e371050eb6 0000000000000003 
[   32.107915] GPR12: c000000000038d10 c00000000fe95300 
[   32.107925] NIP [c00000000000d8dc] arch_local_irq_restore+0x5c/0x80
[   32.107927] LR [c00000000000d8dc] arch_local_irq_restore+0x5c/0x80
[   32.107928] Call Trace:
[   32.107929] [c000002fc621f8b0] [c00000000007e7e0] vtime_common_account_irq_enter+0x48/0x5c (unreliable)
[   32.107932] [c000002fc621f8d0] [c000000000059534] __do_softirq+0xe0/0x2b4
[   32.107934] [c000002fc621f9c0] [c0000000000599a0] irq_exit+0x74/0xcc
[   32.107935] [c000002fc621f9e0] [c00000000001646c] timer_interrupt+0xb0/0xc4
[   32.107937] [c000002fc621fa10] [c00000000000995c] restore_check_irq_replay+0x54/0x70
[   32.107941] --- interrupt: 901 at opal_handle_events+0xb0/0xb8
[   32.107941]     LR = opal_handle_events+0x5c/0xb8
[   32.107943] [c000002fc621fd50] [c00000000003b6ac] kopald+0x2c/0x64
[   32.107945] [c000002fc621fd80] [c0000000000717a0] kthread+0xfc/0x108
[   32.107947] [c000002fc621fe30] [c0000000000094b4] ret_from_kernel_thread+0x5c/0xa8
[   32.107948] Instruction dump:
[   32.107949] f821ffe1 419e000c e92d0020 7d210164 39200000 992d0252 4bffff11 39400001 
[   32.107952] 994d0252 2fa30000 419e0010 4bff4591 <60000000> 48000010 e92d0020 61298000 
[   34.138212] device-mapper: ioctl: 4.34.0-ioctl (2015-10-28) initialised: dm-devel@redhat.com
[   34.187971] tg3 0001:03:00.3 enP1p3s0f3: renamed from eth3
[   34.220791] EXT4-fs (dm-2): mounted filesystem with ordered data mode. Opts: (null)
[   34.260346] EXT4-fs (dm-5): INFO: recovery required on readonly filesystem
[   34.260351] EXT4-fs (dm-5): write access will be enabled during recovery
[   34.272708] EXT4-fs (dm-5): recovery complete
[   34.272743] EXT4-fs (dm-5): mounted filesystem with ordered data mode. Opts: (null)
[   36.479095]  rport-3:0-4: blocked FC remote port time out: removing rport
[   55.007967] tg3 0003:05:00.1 enP3p5s0f1: renamed from eth5
[   78.017790] INFO: rcu_sched self-detected stall on CPU
[   78.017795]  100-...: (1 GPs behind) idle=167/140000000000002/0 softirq=176/176 fqs=2082 
[   78.017796]   (t=2100 jiffies g=-220 c=-221 q=97)
[   78.017817] Task dump for CPU 100:
[   78.017818] kopald          R  running task        0   854      2 0x00000804
[   78.017821] Call Trace:
[   78.017826] [c000002fc621f1e0] [c00000000007c6f0] sched_show_task+0x10c/0x118 (unreliable)
[   78.017830] [c000002fc621f250] [c00000000009eda8] rcu_dump_cpu_stacks+0xa0/0xd0
[   78.017832] [c000002fc621f2a0] [c0000000000a2428] rcu_check_callbacks+0x294/0x734
[   78.017834] [c000002fc621f3c0] [c0000000000a44e0] update_process_times+0x3c/0x74
[   78.017837] [c000002fc621f3f0] [c0000000000b2c74] tick_sched_handle.isra.14+0x60/0x6c
[   78.017840] [c000002fc621f420] [c0000000000b2ccc] tick_sched_timer+0x4c/0x84
[   78.017841] [c000002fc621f460] [c0000000000a4d84] __hrtimer_run_queues+0xd0/0x188
[   78.017843] [c000002fc621f4e0] [c0000000000a53f0] hrtimer_interrupt+0xb0/0x1fc
[   78.017845] [c000002fc621f5a0] [c0000000000162c0] __timer_interrupt+0x7c/0xe0
[   78.017847] [c000002fc621f5e0] [c000000000016468] timer_interrupt+0xac/0xc4
[   78.017849] [c000002fc621f610] [c00000000000995c] restore_check_irq_replay+0x54/0x70
[   78.017852] --- interrupt: 901 at arch_local_irq_restore+0x5c/0x80
[   78.017852]     LR = arch_local_irq_restore+0x5c/0x80
[   78.017854] [c000002fc621f900] [c00000000007e7e0] vtime_common_account_irq_enter+0x48/0x5c (unreliable)
[   78.017857] [c000002fc621f920] [c000000000059534] __do_softirq+0xe0/0x2b4
[   78.017859] [c000002fc621fa10] [c0000000000599a0] irq_exit+0x74/0xcc
[   78.017860] [c000002fc621fa30] [c00000000001646c] timer_interrupt+0xb0/0xc4
[   78.017862] [c000002fc621fa60] [c00000000000995c] restore_check_irq_replay+0x54/0x70
[   78.017866] --- interrupt: 901 at opal_handle_events+0xb4/0xb8
[   78.017866]     LR = opal_handle_events+0x5c/0xb8
[   78.017868] [c000002fc621fd50] [c00000000003b6ac] kopald+0x2c/0x64 (unreliable)
[   78.017871] [c000002fc621fd80] [c0000000000717a0] kthread+0xfc/0x108
[   78.017873] [c000002fc621fe30] [c0000000000094b4] ret_from_kernel_thread+0x5c/0xa8
[   80.047889] tg3 0003:05:00.2 enP3p5s0f2: renamed from eth6
[  103.057786] INFO: rcu_sched self-detected stall on CPU
[  103.057792]  101-...: (1 GPs behind) idle=a0b/140000000000002/0 softirq=288/289 fqs=2099 
[  103.057795]   (t=2100 jiffies g=-218 c=-219 q=72)
[  103.057823] Task dump for CPU 101:
[  103.057825] kopald          R  running task        0   854      2 0x00000804
[  103.057829] Call Trace:
[  103.057834] [c000002fc621f1e0] [c00000000007c6f0] sched_show_task+0x10c/0x118 (unreliable)
[  103.057839] [c000002fc621f250] [c00000000009eda8] rcu_dump_cpu_stacks+0xa0/0xd0
[  103.057843] [c000002fc621f2a0] [c0000000000a2428] rcu_check_callbacks+0x294/0x734
[  103.057847] [c000002fc621f3c0] [c0000000000a44e0] update_process_times+0x3c/0x74
[  103.057851] [c000002fc621f3f0] [c0000000000b2c74] tick_sched_handle.isra.14+0x60/0x6c
[  103.057855] [c000002fc621f420] [c0000000000b2ccc] tick_sched_timer+0x4c/0x84
[  103.057858] [c000002fc621f460] [c0000000000a4d84] __hrtimer_run_queues+0xd0/0x188
[  103.057861] [c000002fc621f4e0] [c0000000000a53f0] hrtimer_interrupt+0xb0/0x1fc
[  103.057865] [c000002fc621f5a0] [c0000000000162c0] __timer_interrupt+0x7c/0xe0
[  103.057868] [c000002fc621f5e0] [c000000000016468] timer_interrupt+0xac/0xc4
[  103.057872] [c000002fc621f610] [c00000000000995c] restore_check_irq_replay+0x54/0x70
[  103.057879] --- interrupt: 901 at arch_local_irq_restore+0x5c/0x80
[  103.057879]     LR = arch_local_irq_restore+0x5c/0x80
[  103.057883] [c000002fc621f900] [c00000000007e7e0] vtime_common_account_irq_enter+0x48/0x5c (unreliable)
[  103.057888] [c000002fc621f920] [c000000000059534] __do_softirq+0xe0/0x2b4
[  103.057891] [c000002fc621fa10] [c0000000000599a0] irq_exit+0x74/0xcc
[  103.057895] [c000002fc621fa30] [c00000000001646c] timer_interrupt+0xb0/0xc4
[  103.057898] [c000002fc621fa60] [c00000000000995c] restore_check_irq_replay+0x54/0terrupt+0xac/0xc4
[  203.267824] [c000002fc621f520] [c00000000000995c] restore_check_irq_replay+0x54/0x70
[  203.267827] --- interrupt: 901 at arch_local_irq_restore+0x5c/0x80
[  203.267827]     LR = arch_local_irq_restore+0x5c/0x80
[  203.267830] [c000002fc621f810] [c00000000007e7e0] vtime_common_account_irq_enter+0x48/0x5c (unreliable)
[  203.267832] [c000002fc621f830] [c000000000059534] __do_softirq+0xe0/0x2b4
[  203.267834] [c000002fc621f920] [c0000000000599a0] irq_exit+0x74/0xcc
[  203.267835] [c000002fc621f940] [c00000000001646c] timer_interrupt+0xb0/0xc4
[  203.267837] [c000002fc621f970] [c00000000000995c] restore_check_irq_replay+0x54/0x70
[  203.267841] --- interrupt: 901 at schedule_timeout+0x84/0x1bc
[  203.267841]     LR = schedule_timeout+0x1c/0x1bc
[  203.267844] [c000002fc621fc60] [c0000000005a58bc] schedule_timeout+0x1a0/0x1bc (unreliable)
[  203.267846] [c000002fc621fd30] [c0000000000a3fdc] msleep_interruptible+0x40/0x5c
[  203.267848] [c000002fc621fd50] [c00000000003b6bc] kopald+0x3c/0x64
[  203.267850] [c000002fc621fd80] [c0000000000717a0] kthread+0xfc/0x108
[  203.267851] [c000002fc621fe30] [c0000000000094b4] ret_from_kernel_thread+0x5c/0xa8
[  226.327769] INFO: rcu_sched self-detected stall on CPU
[  226.327773]  49-...: (2098 ticks this GP) idle=741/140000000000002/0 softirq=3144/3144 fqs=2082 
[  226.327774]   (t=2100 jiffies g=-200 c=-201 q=281)
[  226.327796] Task dump for CPU 49:
[  226.327798] kopald          S 0000000000000000     0   854      2 0x00000804
[  226.327800] Call Trace:
[  226.327806] [c000002fc621f0a0] [c00000000007c6f0] sched_show_task+0x10c/0x118 (unreliable)
[  226.327810] [c000002fc621f110] [c00000000009eda8] rcu_dump_cpu_stacks+0xa0/0xd0
[  226.327812] [c000002fc621f160] [c0000000000a2428] rcu_check_callbacks+0x294/0x734
[  226.327814] [c000002fc621f280] [c0000000000a44e0] update_process_times+0x3c/0x74
[  226.327816] [c000002fc621f2b0] [c0000000000b2c74] tick_sched_handle.isra.14+0x60/0x6c
[  226.327818] [c000002fc621f2e0] [c0000000000b2ccc] tick_sched_timer+0x4c/0x84
[  226.327820] [c000002fc621f320] [c0000000000a4d84] __hrtimer_run_queues+0xd0/0x188
[  226.327822] [c000002fc621f3a0] [c0000000000a53f0] hrtimer_interrupt+0xb0/0x1fc
[  226.327824] [c000002fc621f460] [c0000000000162c0] __timer_interrupt+0x7c/0xe0
[  226.327826] [c000002fc621f4a0] [c000000000016468] timer_interrupt+0xac/0xc4
[  226.327828] [c000002fc621f4d0] [c00000000000995c] restore_check_irq_replay+0x54/0x70
[  226.327832] --- interrupt: 901 at arch_local_irq_restore+0x5c/0x80
[  226.327832]     LR = arch_local_irq_restore+0x5c/0x80
[  226.327834] [c000002fc621f7c0] [c00000000007e7e0] vtime_common_account_irq_enter+0x48/0x5c (unreliable)
[  226.327837] [c000002fc621f7e0] [c000000000059534] __do_softirq+0xe0/0x2b4
[  226.327839] [c000002fc621f8d0] [c0000000000599a0] irq_exit+0x74/0xcc
[  226.327840] [c000002fc621f8f0] [c00000000001646c] timer_interrupt+0xb0/0xc4
[  226.327842] [c000002fc621f920] [c00000000000995c] restore_check_irq_replay+0x54/0x70
[  226.327845] --- interrupt: 901 at lock_timer_base.isra.22+0x30/0x84
[  226.327845]     LR = lock_timer_base.isra.22+0x10/0x84
[  226.327848] [c000002fc621fc10] [00000000ffffd2cc] 0xffffd2cc (unreliable)
[  226.327851] [c000002fc621fc60] [c0000000005a57d0] schedule_timeout+0xb4/0x1bc
[  226.327853] [c000002fc621fd30] [c0000000000a3fdc] msleep_interruptible+0x40/0x5c
[  226.327856] [c000002fc621fd50] [c00000000003b6bc] kopald+0x3c/0x64
[  226.327858] [c000002fc621fd80] [c0000000000717a0] kthread+0xfc/0x108
[  226.327860] [c000002fc621fe30] [c0000000000094b4] ret_from_kernel_thread+0x5c/0xa8
[  253.357774] INFO: rcu_sched self-detected stall on CPU
[  253.357780]  52-...: (2 GPs behind) idle=893/140000000000002/0 softirq=3070/3070 fqs=2085 
[  253.357782]   (t=2100 jiffies g=-196 c=-197 q=168)
[  253.357808] Task dump for CPU 52:
[  253.357810] kopald          S 0000000000000000     0   854      2 0x00000804
[  253.357814] Call Trace:
[  253.357819] [c000002fc621f0a0] [c00000000007c6f0] sched_show_task+0x10c/0x118 (unreliable)
[  253.357823] [c000002fc621f110] [c00000000009eda8] rcu_dump_cpu_stacks+0xa0/0xd0
[  253.357828] [c000002fc621f160] [c0000000000a2428] rcu_check_callbacks+0x294/0x734
[  253.357832] [c000002fc621f280] [c0000000000a44e0] update_process_times+0x3c/0x74
[  253.357836] [c000002fc621f2b0] [c0000000000b2c74] tick_sched_handle.isra.14+0x60/0x6c
[  253.357839] [c000002fc621f2e0] [c0000000000b2ccc] tick_sched_timer+0x4c/0x84
[  253.357842] [c000002fc621f320] [c0000000000a4d84] __hrtimer_run_queues+0xd0/0x188
[  253.357846] [c000002fc621f3a0] [c0000000000a53f0] hrtimer_interrupt+0xb0/0x1fc
[  253.357849] [c000002fc621f460] [c0000000000162c0] __timer_interrupt+0x7c/0xe0
[  253.357852] [c000002fc621f4a0] [c000000000016468] timer_interrupt+0xac/0xc4
[  253.357856] [c000002fc621f4d0] [c00000000000995c] restore_check_irq_replay+0x54/0x70
[  253.357863] --- interrupt: 901 at arch_local_irq_restore+0x5c/0x80
[  253.357863]     LR = arch_local_irq_restore+0x5c/0x80
[  253.357867] [c000002fc621f7c0] [c00000000007e7e0] vtime_common_account_irq_enter+0x48/0x5c (unreliable)
[  253.357872] [c000002fc621f7e0] [c000000000059534] __do_softirq+0xe0/0x2b4
[  253.357875] [c000002fc621f8d0] [c0000000000599a0] irq_exit+0x74/0xcc
[  253.357879] [c000002fc621f8f0] [c00000000001646c] timer_interrupt+0xb0/0xc4
[  253.357883] [c000002fc621f920] [c00000000000995c] restore_check_irq_replay+0x54/0x70
[  253.357888] --- interrupt: 901 at _savegpr0_29+0x0/0x4
[  253.357888]     LR = _raw_spin_lock_irqsave+0x10/0x90
[  253.357892] [c000002fc621fc10] [00000000ffffe593] 0xffffe593 (unreliable)
[  253.357896] [c000002fc621fc60] [c0000000005a57d0] schedule_timeout+0xb4/0x1bc
[  253.357901] [c000002fc621fd30] [c0000000000a3fdc] msleep_interruptible+0x40/0x5c
[  253.357904] [c000002fc621fd50] [c00000000003b6bc] kopald+0x3c/0x64
[  253.357908] [c000002fc621fd80] [c0000000000717a0] kthread+0xfc/0x108
[  253.357912] [c000002fc621fe30] [c0000000000094b4] ret_from_kernel_thread+0x5c/0xa8
[  277.167770] INFO: rcu_sched self-detected stall on CPU
[  277.167775]  53-...: (1 GPs behind) idle=7af/140000000000002/0 softirq=4500/4501 fqs=1832 
[  277.167776]   (t=2100 jiffies g=-194 c=-195 q=452)
[  277.167802] Task dump for CPU 53:
[  277.167804] kopald          R  running task        0   854      2 0x00000804
[  277.167807] Call Trace:
[  277.167810] [c000002fc621f190] [c00000000007c6f0] sched_show_task+0x10c/0x118 (unreliable)
[  277.167814] [c000002fc621f200] [c00000000009eda8] rcu_dump_cpu_stacks+0xa0/0xd0
[  277.167817] [c000002fc621f250] [c0000000000a2428] rcu_check_callbacks+0x294/0x734
[  277.167820] [c000002fc621f370] [c0000000000a44e0] update_process_times+0x3c/0x74
[  277.167822] [c000002fc621f3a0] [c0000000000b2c74] tick_sched_handle.isra.14+0x60/0x6c
[  277.167825] [c000002fc621f3d0] [c0000000000b2ccc] tick_sched_timer+0x4c/0x84
[  277.167827] [c000002fc621f410] [c0000000000a4d84] __hrtimer_run_queues+0xd0/0x188
[  277.167830] [c000002fc621f490] [c0000000000a53f0] hrtimer_interrupt+0xb0/0x1fc
[  277.167833] [c000002fc621f550] [c0000000000162c0] __timer_interrupt+0x7c/0xe0
[  277.167835] [c000002fc621f590] [c000000000016468] timer_interrupt+0xac/0xc4
[  277.167838] [c000002fc621f5c0] [c00000000000995c] restore_check_irq_replay+0x54/0x70
[  277.167842] --- interrupt: 901 at arch_local_irq_restore+0x5c/0x80
[  277.167842]     LR = arch_local_irq_restore+0x5c/0x80
[  277.167846] [c000002fc621f8b0] [c00000000007e7e0] vtime_common_account_irq_enter+0x48/0x5c (unreliable)
[  277.167849] [c000002fc621f8d0] [c000000000059534] __do_softirq+0xe0/0x2b4
[  277.167852] [c000002fc621f9c0] [c0000000000599a0] irq_exit+0x74/0xcc
[  277.167854] [c000002fc621f9e0] [c00000000001646c] timer_interrupt+0xb0/0xc4
[  277.167857] [c000002fc621fa10] [c00000000000995c] restore_check_irq_replay+0x54/0x70
[  277.167860] --- interrupt: 901 at irq_work_queue+0xa0/0xa4
[  277.167860]     LR = irq_work_queue+0x78/0xa4
[  277.167864] [c000002fc621fd00] [c00000000003f528] opal_handle_events+0x5c/0xb8 (unreliable)
[  277.167867] [c000002fc621fd50] [c00000000003b6ac] kopald+0x2c/0x64
[  277.167870] [c000002fc621fd80] [c0000000000717a0] kthread+0xfc/0x108
[  277.167873] [c000002fc621fe30] [c0000000000094b4] ret_from_kernel_thread+0x5c/0xa8

/ # 
/ # update_flash -d
Current firwmare version :
  P side    : FW860.12 (SV860_066)
  T side    : FW860.20 (SV860_067)
  Boot side : FW860.20 (SV860_067)
/ # 

/ # cat /sys/firmware/opal/msglog | grep -i skiboot
[   81.163066898,5] SkiBoot f38ed26 starting...
/ #81 
hegdevasant commented 6 years ago

Interesting. Something similar to other bug we saw internally. kopald makes opal call to get outstanding events and then calls our handler in kernel.

-Vasant

hegdevasant commented 6 years ago

Can you please retry with upstream kernel?

-Vasant

npiggin commented 6 years ago

I proposed a patch to process opal events directly by kopald, and to enable and disable interrupts and also cond_resched between processing each event. It might improve this situation.

https://marc.info/?l=linuxppc-embedded&m=150902327030882&w=2

pridhiviraj commented 6 years ago

@hegdevasant Reproduced with both official build and upstream kernel.

<4>[    0.000000] Could not find start_pfn for node 17
<4>[    1.918993] This architecture does not have kernel memory protection.
<4>[    1.997483] Emulex LightPulse Fibre Channel SCSI driver 11.4.0.1
<4>[    1.997487] Copyright (C) 2017 Broadcom. All Rights Reserved. The term "Broadcom" refers to Broadcom Limited and/or its subsidiaries.
<3>[    2.001923] EEH: Frozen PE#fd on PHB#1 detected
<3>[    2.001929] EEH: PE location: U78CB.001.WZS0030-P1-C14, PHB location: N/A
<4>[    2.002752] EEH: This PCI device has failed 1 times in the last hour
<4>[    2.002793] EEH: of node=0001:04:00.0
<4>[    2.002796] EEH: PCI device/vendor: 034a1014
<4>[    2.002799] EEH: PCI cmd/status register: 00100142
<4>[    2.002800] EEH: PCI-E capabilities and status follow:
<4>[    2.002853] EEH: PCI-E 00: 0002a410 10008fc2 0000283e 0003f503 
<4>[    2.002862] EEH: PCI-E 10: 10830000 00000000 00000000 00000000 
<4>[    2.002863] EEH: PCI-E 20: 00000000 
<4>[    2.002864] EEH: PCI-E AER capability register set follows:
<4>[    2.002875] EEH: PCI-E AER 00: 14820001 00000000 00400000 00462030 
<4>[    2.002885] EEH: PCI-E AER 10: 00000000 0000e000 000001e0 00000000 
<4>[    2.002894] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000 
<4>[    2.002897] EEH: PCI-E AER 30: 00000000 00000000 
<3>[    2.211328] ELOG: OPAL log info read failed
<3>[    2.348038] lpfc 0004:01:00.0: 0:6101 Disabling NVME support: Not supported by firmware: 0 0
<3>[    2.348043] lpfc 0004:01:00.0: 0:2574 IO channels: irqs 4 fcp 4 nvme 0 MRQ: 0
<3>[    4.578036] lpfc 0004:01:00.1: 1:6101 Disabling NVME support: Not supported by firmware: 0 0
<3>[    4.578041] lpfc 0004:01:00.1: 1:2574 IO channels: irqs 4 fcp 4 nvme 0 MRQ: 0
<3>[   10.254238] lpfc 0005:0b:00.0: 2:1303 Link Up Event x1 received Data: x1 x1 x10 x0 x0 x0 0
<3>[   13.159540] lpfc 0005:0b:00.1: 3:1303 Link Up Event x1 received Data: x1 x1 x10 x0 x0 x0 0
<3>[   26.918024] INFO: rcu_sched self-detected stall on CPU
<3>[   26.918033]   35-...: (2 GPs behind) idle=1a2/140000000000002/0 softirq=56/58 fqs=900 
<3>[   26.918033]    (t=2100 jiffies g=-265 c=-266 q=730)
<1>[   26.918048] Call Trace:
<1>[   26.918053] [c000000ff4a0b1c0] [c0000000000926cc] sched_show_task+0xe8/0xf4 (unreliable)
<1>[   26.918057] [c000000ff4a0b230] [c0000000000c0500] rcu_dump_cpu_stacks+0xd0/0xe0
<1>[   26.918060] [c000000ff4a0b280] [c0000000000bfa80] rcu_check_callbacks+0x304/0x788
<1>[   26.918062] [c000000ff4a0b3b0] [c0000000000c3054] update_process_times+0x3c/0x74
<1>[   26.918065] [c000000ff4a0b3e0] [c0000000000d2330] tick_sched_handle.isra.3+0x50/0x64
<1>[   26.918068] [c000000ff4a0b400] [c0000000000d2390] tick_sched_timer+0x4c/0x8c
<1>[   26.918069] [c000000ff4a0b440] [c0000000000c3b1c] __hrtimer_run_queues+0xe0/0x16c
<1>[   26.918071] [c000000ff4a0b4c0] [c0000000000c3dbc] hrtimer_interrupt+0xac/0x1f0
<1>[   26.918074] [c000000ff4a0b580] [c000000000019228] __timer_interrupt+0x7c/0xdc
<1>[   26.918076] [c000000ff4a0b5c0] [c000000000019330] timer_interrupt+0xa8/0xc0
<1>[   26.918079] [c000000ff4a0b5f0] [c00000000000b864] fast_exception_return+0x128/0x150
<1>[   26.918082] --- interrupt: 901 at .L144+0x0/0x4
<1>[   26.918082]     LR = arch_local_irq_restore+0x5c/0x80
<1>[   26.918084] [c000000ff4a0b8e0] [c000000000093588] vtime_account_irq_enter+0x54/0x5c (unreliable)
<1>[   26.918086] [c000000ff4a0b900] [c00000000006eb80] __do_softirq+0xd8/0x290
<1>[   26.918089] [c000000ff4a0b9f0] [c00000000006efe0] irq_exit+0x88/0xe0
<1>[   26.918091] [c000000ff4a0ba10] [c000000000019334] timer_interrupt+0xac/0xc0
<1>[   26.918093] [c000000ff4a0ba40] [c00000000000b864] fast_exception_return+0x128/0x150
<1>[   26.918097] --- interrupt: 901 at _restgpr0_31+0x0/0xba38
<1>[   26.918097]     LR = irq_work_queue+0x74/0xa4
<1>[   26.918099] [c000000ff4a0bd30] [c00000000004b204] opal_handle_events+0x54/0xa8 (unreliable)
<1>[   26.918101] [c000000ff4a0bd80] [c00000000004755c] kopald+0x44/0x70
<1>[   26.918103] [c000000ff4a0bdc0] [c0000000000880b4] kthread+0x150/0x158
<1>[   26.918106] [c000000ff4a0be30] [c00000000000b464] ret_from_kernel_thread+0x5c/0x78
<1>[   90.728029] .
<1>[   90.728073] ready
<1>[   90.808024] .
<1>[   90.808025] .
<1>[   90.808027] .
<1>[   90.808029] .
<1>[   90.808105] ready
<1>[   90.808110] ready
<1>[   90.808117] ready
<1>[   90.808128] ready
<1>[   90.888025] .
<1>[   90.888025] .
<1>[   90.888026] .
<1>[   90.888034] .
<1>[   90.888102] ready
<1>[   90.888108] ready
<1>[   90.888119] ready
<1>[   90.888189] ready
<3>[   90.928141] sd 4:2:0:0: [sdg] Assuming drive cache: write through
<1>[   90.968025] .
<1>[   90.968039] .
<1>[   90.968099] ready
<1>[   90.968149] ready
<3>[   91.008129] sd 4:2:1:0: [sdh] Assuming drive cache: write through
<3>[   91.008138] sd 1:2:0:0: [sda] Assuming drive cache: write through
<3>[   91.008145] sd 4:2:2:0: [sdi] Assuming drive cache: write through
<3>[   91.008219] sd 1:2:1:0: [sdb] Assuming drive cache: write through
<3>[   91.088145] sd 4:2:4:0: [sdk] Assuming drive cache: write through
<3>[   91.088152] sd 4:2:3:0: [sdj] Assuming drive cache: write through
<3>[   91.088156] sd 1:2:2:0: [sdc] Assuming drive cache: write through
<3>[   91.088160] sd 1:2:3:0: [sdd] Assuming drive cache: write through
<3>[   91.178159] sd 1:2:4:0: [sde] Assuming drive cache: write through
<3>[   91.178162] sd 1:2:5:0: [sdf] Assuming drive cache: write through
<3>[   93.802071] be2net 0044:01:00.0: opcode 36-1 failed:status 68-0
<3>[   98.362057] be2net 0044:01:00.1: opcode 36-1 failed:status 68-0
<3>[   98.828028] lpfc 0044:01:00.4: 4:6101 Disabling NVME support: Not supported by firmware: 0 0
<3>[   98.828033] lpfc 0044:01:00.4: 4:2574 IO channels: irqs 4 fcp 4 nvme 0 MRQ: 0
<3>[  101.698029] lpfc 0044:01:00.5: 5:6101 Disabling NVME support: Not supported by firmware: 0 0
<3>[  101.698034] lpfc 0044:01:00.5: 5:2574 IO channels: irqs 4 fcp 4 nvme 0 MRQ: 0
<3>[  103.838062] print_req_error: I/O error, dev sdl, sector 0
<3>[  103.838065] Buffer I/O error on dev sdl, logical block 0, async page read
<3>[  116.318552] print_req_error: I/O error, dev sdl, sector 0
<3>[  116.318555] Buffer I/O error on dev sdl, logical block 0, async page read
<3>[  118.398071] print_req_error: I/O error, dev sdl, sector 0
<3>[  118.914736] print_req_error: I/O error, dev sdl, sector 41942912
<3>[  119.431405] print_req_error: I/O error, dev sdl, sector 41942912
<3>[  119.948084] print_req_error: I/O error, dev sdl, sector 0
<3>[  120.464772] print_req_error: I/O error, dev sdl, sector 0
<3>[  120.981446] print_req_error: I/O error, dev sdl, sector 0
<3>[  121.533444] print_req_error: I/O error, dev sdl, sector 0
<3>[  122.114796] print_req_error: I/O error, dev sdl, sector 0
<3>[  122.631448] print_req_error: I/O error, dev sdl, sector 41942912
<3>[  123.138841] print_req_error: I/O error, dev sdl, sector 41942912
<3>[  123.648150] print_req_error: I/O error, dev sdl, sector 0
<3>[  124.164808] print_req_error: I/O error, dev sdl, sector 0
<3>[  124.681491] print_req_error: I/O error, dev sdl, sector 0
<3>[  125.231501] print_req_error: I/O error, dev sdl, sector 0
<3>[  125.814890] print_req_error: I/O error, dev sdl, sector 0
<3>[  126.331732] print_req_error: I/O error, dev sdl, sector 41942912
<3>[  126.848208] print_req_error: I/O error, dev sdl, sector 41942912
<3>[  127.364870] print_req_error: I/O error, dev sdl, sector 0
<3>[  127.881553] print_req_error: I/O error, dev sdl, sector 0
<3>[  128.398233] print_req_error: I/O error, dev sdl, sector 0
<3>[  128.948240] print_req_error: I/O error, dev sdl, sector 0
<3>[  129.531646] print_req_error: I/O error, dev sdl, sector 0
<3>[  130.048262] print_req_error: I/O error, dev sdl, sector 41942912
<3>[  130.555625] print_req_error: I/O error, dev sdl, sector 41942912
<3>[  131.064950] print_req_error: I/O error, dev sdl, sector 0
<3>[  131.581625] print_req_error: I/O error, dev sdl, sector 0
<3>[  132.098328] print_req_error: I/O error, dev sdl, sector 0
<3>[  132.648309] print_req_error: I/O error, dev sdl, sector 0
<3>[  133.231650] print_req_error: I/O error, dev sdl, sector 0
<3>[  133.748325] print_req_error: I/O error, dev sdl, sector 41942912
<3>[  134.265002] print_req_error: I/O error, dev sdl, sector 41942912
<3>[  134.781690] print_req_error: I/O error, dev sdl, sector 0
<3>[  135.298348] print_req_error: I/O error, dev sdl, sector 0
<3>[  135.815028] print_req_error: I/O error, dev sdl, sector 0
<3>[  136.365049] print_req_error: I/O error, dev sdl, sector 0
<3>[  137.015082] print_req_error: I/O error, dev sdl, sector 0
<3>[  137.531789] print_req_error: I/O error, dev sdl, sector 41942912
<3>[  138.048408] print_req_error: I/O error, dev sdl, sector 41942912
<3>[  138.565080] print_req_error: I/O error, dev sdl, sector 0
<3>[  139.081755] print_req_error: I/O error, dev sdl, sector 0
<3>[  139.418023] INFO: rcu_sched self-detected stall on CPU
<3>[  139.418034]   36-...: (1 GPs behind) idle=db6/140000000000002/0 softirq=92/92 fqs=1049 
<3>[  139.418036]    (t=2100 jiffies g=-238 c=-239 q=8292)
<1>[  139.418056] Call Trace:
<1>[  139.418061] [c000000ff4a0b210] [c0000000000926cc] sched_show_task+0xe8/0xf4 (unreliable)
<1>[  139.418066] [c000000ff4a0b280] [c0000000000c0500] rcu_dump_cpu_stacks+0xd0/0xe0
<1>[  139.418069] [c000000ff4a0b2d0] [c0000000000bfa80] rcu_check_callbacks+0x304/0x788
<1>[  139.418071] [c000000ff4a0b400] [c0000000000c3054] update_process_times+0x3c/0x74
<1>[  139.418075] [c000000ff4a0b430] [c0000000000d2330] tick_sched_handle.isra.3+0x50/0x64
<1>[  139.418077] [c000000ff4a0b450] [c0000000000d2390] tick_sched_timer+0x4c/0x8c
<1>[  139.418080] [c000000ff4a0b490] [c0000000000c3b1c] __hrtimer_run_queues+0xe0/0x16c
<1>[  139.418081] [c000000ff4a0b510] [c0000000000c3dbc] hrtimer_interrupt+0xac/0x1f0
<1>[  139.418084] [c000000ff4a0b5d0] [c000000000019228] __timer_interrupt+0x7c/0xdc
<1>[  139.418086] [c000000ff4a0b610] [c000000000019330] timer_interrupt+0xa8/0xc0
<1>[  139.418089] [c000000ff4a0b640] [c00000000000b864] fast_exception_return+0x128/0x150
<1>[  139.418092] --- interrupt: 901 at .L144+0x0/0x4
<1>[  139.418092]     LR = arch_local_irq_restore+0x5c/0x80
<1>[  139.418094] [c000000ff4a0b930] [c000000000093588] vtime_account_irq_enter+0x54/0x5c (unreliable)
<1>[  139.418097] [c000000ff4a0b950] [c00000000006eb80] __do_softirq+0xd8/0x290
<1>[  139.418099] [c000000ff4a0ba40] [c00000000006efe0] irq_exit+0x88/0xe0
<1>[  139.418101] [c000000ff4a0ba60] [c000000000019334] timer_interrupt+0xac/0xc0
<1>[  139.418103] [c000000ff4a0ba90] [c00000000000b864] fast_exception_return+0x128/0x150
<1>[  139.418107] --- interrupt: 901 at schedule_timeout_interruptible+0x10/0x18
<1>[  139.418107]     LR = kopald+0x50/0x70
<1>[  139.418109] [c000000ff4a0bd80] [c00000000004755c] kopald+0x44/0x70 (unreliable)
<1>[  139.418111] [c000000ff4a0bdc0] [c0000000000880b4] kthread+0x150/0x158
<1>[  139.418113] [c000000ff4a0be30] [c00000000000b464] ret_from_kernel_thread+0x5c/0x78
<3>[  139.598447] print_req_error: I/O error, dev sdl, sector 0
<3>[  140.148456] print_req_error: I/O error, dev sdl, sector 0
<3>[  149.365309] print_req_error: I/O error, dev sdl, sector 41942912
<3>[  149.881955] print_req_error: I/O error, dev sdl, sector 41942912
<3>[  149.881957] Buffer I/O error on dev sdl, logical block 327679, async page read
<3>[  174.032430] print_req_error: I/O error, dev sdn, sector 0
<3>[  174.032433] Buffer I/O error on dev sdn, logical block 0, async page read
<3>[  178.558233] print_req_error: I/O error, dev sdn, sector 0
<3>[  178.558235] Buffer I/O error on dev sdn, logical block 0, async page read
<3>[  183.038356] print_req_error: I/O error, dev sdn, sector 41942912
<3>[  191.358282] print_req_error: I/O error, dev sdn, sector 41942912
<3>[  191.358284] Buffer I/O error on dev sdn, logical block 327679, async page read
<3>[  212.368021] INFO: rcu_sched self-detected stall on CPU
<3>[  212.368026]   38-...: (1 GPs behind) idle=79a/140000000000002/0 softirq=797/797 fqs=989 
<3>[  212.368026]    (t=2100 jiffies g=-220 c=-221 q=949)
<1>[  212.368038] Call Trace:
<1>[  212.368042] [c000000ff4a0b0e0] [c0000000000926cc] sched_show_task+0xe8/0xf4 (unreliable)
<1>[  212.368046] [c000000ff4a0b150] [c0000000000c0500] rcu_dump_cpu_stacks+0xd0/0xe0
<1>[  212.368048] [c000000ff4a0b1a0] [c0000000000bfa80] rcu_check_callbacks+0x304/0x788
<1>[  212.368050] [c000000ff4a0b2d0] [c0000000000c3054] update_process_times+0x3c/0x74
<1>[  212.368052] [c000000ff4a0b300] [c0000000000d2330] tick_sched_handle.isra.3+0x50/0x64
<1>[  212.368054] [c000000ff4a0b320] [c0000000000d2390] tick_sched_timer+0x4c/0x8c
<1>[  212.368056] [c000000ff4a0b360] [c0000000000c3b1c] __hrtimer_run_queues+0xe0/0x16c
<1>[  212.368058] [c000000ff4a0b3e0] [c0000000000c3dbc] hrtimer_interrupt+0xac/0x1f0
<1>[  212.368061] [c000000ff4a0b4a0] [c000000000019228] __timer_interrupt+0x7c/0xdc
<1>[  212.368063] [c000000ff4a0b4e0] [c000000000019330] timer_interrupt+0xa8/0xc0
<1>[  212.368065] [c000000ff4a0b510] [c00000000000b864] fast_exception_return+0x128/0x150
<1>[  212.368068] --- interrupt: 901 at .L144+0x0/0x4
<1>[  212.368068]     LR = arch_local_irq_restore+0x5c/0x80
<1>[  212.368070] [c000000ff4a0b800] [c000000000093588] vtime_account_irq_enter+0x54/0x5c (unreliable)
<1>[  212.368073] [c000000ff4a0b820] [c00000000006eb80] __do_softirq+0xd8/0x290
<1>[  212.368075] [c000000ff4a0b910] [c00000000006efe0] irq_exit+0x88/0xe0
<1>[  212.368077] [c000000ff4a0b930] [c000000000019334] timer_interrupt+0xac/0xc0
<1>[  212.368079] [c000000ff4a0b960] [c00000000000b864] fast_exception_return+0x128/0x150
<1>[  212.368082] --- interrupt: 901 at lock_timer_base+0x44/0xa4
<1>[  212.368082]     LR = lock_timer_base+0x10/0xa4
<1>[  212.368083] [c000000ff4a0bc50] [c000000000879a80] 0xc000000000879a80 (unreliable)
<1>[  212.368086] [c000000ff4a0bca0] [c0000000005f33b8] schedule_timeout+0xc0/0x278
<1>[  212.368088] [c000000ff4a0bd80] [c000000000047568] kopald+0x50/0x70
<1>[  212.368090] [c000000ff4a0bdc0] [c0000000000880b4] kthread+0x150/0x158
<1>[  212.368092] [c000000ff4a0be30] [c00000000000b464] ret_from_kernel_thread+0x5c/0x78
<3>[  233.608027] INFO: rcu_sched self-detected stall on CPU
<3>[  233.608031]   38-...: (1 GPs behind) idle=7a6/140000000000002/0 softirq=798/799 fqs=840 
<3>[  233.608032]    (t=2100 jiffies g=-219 c=-220 q=998)
<1>[  233.608041] Call Trace:
<1>[  233.608044] [c000000ff4a0b210] [c0000000000926cc] sched_show_task+0xe8/0xf4 (unreliable)
<1>[  233.608047] [c000000ff4a0b280] [c0000000000c0500] rcu_dump_cpu_stacks+0xd0/0xe0
<1>[  233.608049] [c000000ff4a0b2d0] [c0000000000bfa80] rcu_check_callbacks+0x304/0x788
<1>[  233.608051] [c000000ff4a0b400] [c0000000000c3054] update_process_times+0x3c/0x74
<1>[  233.608053] [c000000ff4a0b430] [c0000000000d2330] tick_sched_handle.isra.3+0x50/0x64
<1>[  233.608055] [c000000ff4a0b450] [c0000000000d2390] tick_sched_timer+0x4c/0x8c
<1>[  233.608057] [c000000ff4a0b490] [c0000000000c3b1c] __hrtimer_run_queues+0xe0/0x16c
<1>[  233.608059] [c000000ff4a0b510] [c0000000000c3dbc] hrtimer_interrupt+0xac/0x1f0
<1>[  233.608061] [c000000ff4a0b5d0] [c000000000019228] __timer_interrupt+0x7c/0xdc
<1>[  233.608063] [c000000ff4a0b610] [c000000000019330] timer_interrupt+0xa8/0xc0
<1>[  233.608065] [c000000ff4a0b640] [c00000000000b864] fast_exception_return+0x128/0x150
<1>[  233.608068] --- interrupt: 901 at .L144+0x0/0x4
<1>[  233.608068]     LR = arch_local_irq_restore+0x5c/0x80
<1>[  233.608070] [c000000ff4a0b930] [c000000000093588] vtime_account_irq_enter+0x54/0x5c (unreliable)
<1>[  233.608073] [c000000ff4a0b950] [c00000000006eb80] __do_softirq+0xd8/0x290
<1>[  233.608075] [c000000ff4a0ba40] [c00000000006efe0] irq_exit+0x88/0xe0
<1>[  233.608077] [c000000ff4a0ba60] [c000000000019334] timer_interrupt+0xac/0xc0
<1>[  233.608079] [c000000ff4a0ba90] [c00000000000b864] fast_exception_return+0x128/0x150
<1>[  233.608082] --- interrupt: 901 at schedule_timeout+0x1c/0x278
<1>[  233.608082]     LR = schedule_timeout+0x1c/0x278
<1>[  233.608083] [c000000ff4a0bd80] [c000000000047568] kopald+0x50/0x70 (unreliable)
<1>[  233.608086] [c000000ff4a0bdc0] [c0000000000880b4] kthread+0x150/0x158
<1>[  233.608088] [c000000ff4a0be30] [c00000000000b464] ret_from_kernel_thread+0x5c/0x78
<3>[  244.283717] print_req_error: I/O error, dev sdp, sector 0
<3>[  244.283719] Buffer I/O error on dev sdp, logical block 0, async page read
<3>[  244.800392] print_req_error: I/O error, dev sdp, sector 0
<3>[  244.800394] Buffer I/O error on dev sdp, logical block 0, async page read
<3>[  245.317070] print_req_error: I/O error, dev sdp, sector 41942912
<3>[  245.833807] print_req_error: I/O error, dev sdp, sector 41942912
<3>[  245.833808] Buffer I/O error on dev sdp, logical block 327679, async page read
<3>[  264.798018] INFO: rcu_sched detected stalls on CPUs/tasks:
<3>[  264.798027]   (detected by 11, t=2102 jiffies, g=-215, c=-216, q=969)
<3>[  264.798040] All QSes seen, last rcu_sched kthread activity 218 (4294963776-4294963558), jiffies_till_next_fqs=1, root ->qsmask 0x0
<1>[  264.798046] Call Trace:
<1>[  264.798051] [c000000ffa25f700] [c0000000000926cc] sched_show_task+0xe8/0xf4 (unreliable)
<1>[  264.798056] [c000000ffa25f770] [c0000000000bfe38] rcu_check_callbacks+0x6bc/0x788
<1>[  264.798059] [c000000ffa25f8a0] [c0000000000c3054] update_process_times+0x3c/0x74
<1>[  264.798063] [c000000ffa25f8d0] [c0000000000d2330] tick_sched_handle.isra.3+0x50/0x64
<1>[  264.798067] [c000000ffa25f8f0] [c0000000000d2390] tick_sched_timer+0x4c/0x8c
<1>[  264.798070] [c000000ffa25f930] [c0000000000c3b1c] __hrtimer_run_queues+0xe0/0x16c
<1>[  264.798073] [c000000ffa25f9b0] [c0000000000c3dbc] hrtimer_interrupt+0xac/0x1f0
<1>[  264.798077] [c000000ffa25fa70] [c000000000019228] __timer_interrupt+0x7c/0xdc
<1>[  264.798080] [c000000ffa25fab0] [c000000000019330] timer_interrupt+0xa8/0xc0
<1>[  264.798084] [c000000ffa25fae0] [c000000000009008] decrementer_common+0x128/0x130
<1>[  264.798089] --- interrupt: 901 at .L144+0x0/0x4
<1>[  264.798089]     LR = arch_local_irq_restore+0x5c/0x80
<1>[  264.798090] [c000000ffa25fdd0] [0000000000000000]           (null) (unreliable)
<1>[  264.798095] [c000000ffa25fdf0] [c00000000050a53c] cpuidle_enter_state+0x1b0/0x214
<1>[  264.798098] [c000000ffa25fe40] [c0000000000a2c50] call_cpuidle+0x6c/0x74
<1>[  264.798101] [c000000ffa25fe60] [c0000000000a2ef8] do_idle+0x1f0/0x204
<1>[  264.798105] [c000000ffa25fec0] [c0000000000a3098] cpu_startup_entry+0x30/0x34
<1>[  264.798109] [c000000ffa25fef0] [c00000000002a3e8] start_secondary+0x4cc/0x4e4
<1>[  264.798112] [c000000ffa25ff90] [c00000000000aa6c] start_secondary_prolog+0x10/0x14
<3>[  264.798116] rcu_sched kthread starved for 218 jiffies! g18446744073709551401 c18446744073709551400 f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0
<1>[  264.798121] Call Trace:
<1>[  264.798123] [c000000ffa2278e0] [0000002400000000] 0x2400000000 (unreliable)
<1>[  264.798127] [c000000ffa227ab0] [c000000000013a18] __switch_to+0x188/0x210
<1>[  264.798130] [c000000ffa227b10] [c0000000005efab4] __schedule+0x464/0x840
<1>[  264.798133] [c000000ffa227be0] [c0000000005eff14] schedule+0x84/0x98
<1>[  264.798136] [c000000ffa227c10] [c0000000005f3500] schedule_timeout+0x208/0x278
<1>[  264.798139] [c000000ffa227cf0] [c0000000000be548] rcu_gp_kthread+0x824/0x830
<1>[  264.798142] [c000000ffa227dc0] [c0000000000880b4] kthread+0x150/0x158
<1>[  264.798146] [c000000ffa227e30] [c00000000000b464] ret_from_kernel_thread+0x5c/0x78
<3>[  289.808023] INFO: rcu_sched self-detected stall on CPU
<3>[  289.808032]   39-...: (1 GPs behind) idle=cb6/140000000000002/0 softirq=3050/3051 fqs=1030 
<3>[  289.808033]    (t=2100 jiffies g=-214 c=-215 q=1359)
<1>[  289.808049] Call Trace:
<1>[  289.808053] [c000000ff4a0b130] [c0000000000926cc] sched_show_task+0xe8/0xf4 (unreliable)
<1>[  289.808056] [c000000ff4a0b1a0] [c0000000000c0500] rcu_dump_cpu_stacks+0xd0/0xe0
<1>[  289.808058] [c000000ff4a0b1f0] [c0000000000bfa80] rcu_check_callbacks+0x304/0x788
<1>[  289.808060] [c000000ff4a0b320] [c0000000000c3054] update_process_times+0x3c/0x74
<1>[  289.808063] [c000000ff4a0b350] [c0000000000d2330] tick_sched_handle.isra.3+0x50/0x64
<1>[  289.808065] [c000000ff4a0b370] [c0000000000d2390] tick_sched_timer+0x4c/0x8c
<1>[  289.808067] [c000000ff4a0b3b0] [c0000000000c3b1c] __hrtimer_run_queues+0xe0/0x16c
<1>[  289.808069] [c000000ff4a0b430] [c0000000000c3dbc] hrtimer_interrupt+0xac/0x1f0
<1>[  289.808071] [c000000ff4a0b4f0] [c000000000019228] __timer_interrupt+0x7c/0xdc
<1>[  289.808073] [c000000ff4a0b530] [c000000000019330] timer_interrupt+0xa8/0xc0
<1>[  289.808075] [c000000ff4a0b560] [c00000000000b864] fast_exception_return+0x128/0x150
<1>[  289.808079] --- interrupt: 901 at .L144+0x0/0x4
<1>[  289.808079]     LR = arch_local_irq_restore+0x5c/0x80
<1>[  289.808080] [c000000ff4a0b850] [c000000000093588] vtime_account_irq_enter+0x54/0x5c (unreliable)
<1>[  289.808083] [c000000ff4a0b870] [c00000000006eb80] __do_softirq+0xd8/0x290
<1>[  289.808085] [c000000ff4a0b960] [c00000000006efe0] irq_exit+0x88/0xe0
<1>[  289.808087] [c000000ff4a0b980] [c000000000019334] timer_interrupt+0xac/0xc0
<1>[  289.808089] [c000000ff4a0b9b0] [c00000000000b864] fast_exception_return+0x128/0x150
<1>[  289.808093] --- interrupt: 901 at schedule_timeout+0x88/0x278
<1>[  289.808093]     LR = schedule_timeout+0x1c/0x278
<1>[  289.808094] [c000000ff4a0bca0] [c0000000005f350c] schedule_timeout+0x214/0x278 (unreliable)
<1>[  289.808097] [c000000ff4a0bd80] [c000000000047568] kopald+0x50/0x70
<1>[  289.808099] [c000000ff4a0bdc0] [c0000000000880b4] kthread+0x150/0x158
<1>[  289.808101] [c000000ff4a0be30] [c00000000000b464] ret_from_kernel_thread+0x5c/0x78
<3>[  335.728022] INFO: rcu_sched self-detected stall on CPU
<3>[  335.728027]   24-...: (1 GPs behind) idle=c0e/140000000000002/0 softirq=461/461 fqs=1049 
<3>[  335.728027]    (t=2100 jiffies g=-211 c=-212 q=516)
<1>[  335.728044] Call Trace:
<1>[  335.728047] [c000000ff4a0b1c0] [c0000000000926cc] sched_show_task+0xe8/0xf4 (unreliable)
<1>[  335.728050] [c000000ff4a0b230] [c0000000000c0500] rcu_dump_cpu_stacks+0xd0/0xe0
<1>[  335.728053] [c000000ff4a0b280] [c0000000000bfa80] rcu_check_callbacks+0x304/0x788
<1>[  335.728055] [c000000ff4a0b3b0] [c0000000000c3054] update_process_times+0x3c/0x74
<1>[  335.728057] [c000000ff4a0b3e0] [c0000000000d2330] tick_sched_handle.isra.3+0x50/0x64
<1>[  335.728059] [c000000ff4a0b400] [c0000000000d2390] tick_sched_timer+0x4c/0x8c
<1>[  335.728061] [c000000ff4a0b440] [c0000000000c3b1c] __hrtimer_run_queues+0xe0/0x16c
<1>[  335.728063] [c000000ff4a0b4c0] [c0000000000c3dbc] hrtimer_interrupt+0xac/0x1f0
<1>[  335.728065] [c000000ff4a0b580] [c000000000019228] __timer_interrupt+0x7c/0xdc
<1>[  335.728067] [c000000ff4a0b5c0] [c000000000019330] timer_interrupt+0xa8/0xc0
<1>[  335.728069] [c000000ff4a0b5f0] [c00000000000b864] fast_exception_return+0x128/0x150
<1>[  335.728073] --- interrupt: 901 at .L144+0x0/0x4
<1>[  335.728073]     LR = arch_local_irq_restore+0x5c/0x80
<1>[  335.728074] [c000000ff4a0b8e0] [c000000000093588] vtime_account_irq_enter+0x54/0x5c (unreliable)
<1>[  335.728077] [c000000ff4a0b900] [c00000000006eb80] __do_softirq+0xd8/0x290
<1>[  335.728079] [c000000ff4a0b9f0] [c00000000006efe0] irq_exit+0x88/0xe0
<1>[  335.728081] [c000000ff4a0ba10] [c000000000019334] timer_interrupt+0xac/0xc0
<1>[  335.728083] [c000000ff4a0ba40] [c00000000000b864] fast_exception_return+0x128/0x150
<1>[  335.728087] --- interrupt: 901 at _restgpr0_31+0x0/0xba38
<1>[  335.728087]     LR = irq_work_queue+0x74/0xa4
<1>[  335.728089] [c000000ff4a0bd30] [c00000000004b204] opal_handle_events+0x54/0xa8 (unreliable)
<1>[  335.728091] [c000000ff4a0bd80] [c00000000004755c] kopald+0x44/0x70
<1>[  335.728093] [c000000ff4a0bdc0] [c0000000000880b4] kthread+0x150/0x158
<1>[  335.728096] [c000000ff4a0be30] [c00000000000b464] ret_from_kernel_thread+0x5c/0x78
/ # 
pridhiviraj commented 6 years ago

@npiggin Thanks, Have you done the tests mention in this issue against your patches?

npiggin commented 6 years ago

No I haven't had a chance to yet, I've been caught up doing P9 work at the moment.

npiggin commented 6 years ago

I brought the patch up to date and resubmitted it, but... I don't think it solves the root cause of the problem, there must be a long running OPAL event that triggers it AFAIKS. Maybe EEH?