Open ghost opened 6 years ago
Interesting. Something similar to other bug we saw internally. kopald makes opal call to get outstanding events and then calls our handler in kernel.
-Vasant
Can you please retry with upstream kernel?
-Vasant
I proposed a patch to process opal events directly by kopald, and to enable and disable interrupts and also cond_resched between processing each event. It might improve this situation.
https://marc.info/?l=linuxppc-embedded&m=150902327030882&w=2
@hegdevasant Reproduced with both official build and upstream kernel.
<4>[ 0.000000] Could not find start_pfn for node 17
<4>[ 1.918993] This architecture does not have kernel memory protection.
<4>[ 1.997483] Emulex LightPulse Fibre Channel SCSI driver 11.4.0.1
<4>[ 1.997487] Copyright (C) 2017 Broadcom. All Rights Reserved. The term "Broadcom" refers to Broadcom Limited and/or its subsidiaries.
<3>[ 2.001923] EEH: Frozen PE#fd on PHB#1 detected
<3>[ 2.001929] EEH: PE location: U78CB.001.WZS0030-P1-C14, PHB location: N/A
<4>[ 2.002752] EEH: This PCI device has failed 1 times in the last hour
<4>[ 2.002793] EEH: of node=0001:04:00.0
<4>[ 2.002796] EEH: PCI device/vendor: 034a1014
<4>[ 2.002799] EEH: PCI cmd/status register: 00100142
<4>[ 2.002800] EEH: PCI-E capabilities and status follow:
<4>[ 2.002853] EEH: PCI-E 00: 0002a410 10008fc2 0000283e 0003f503
<4>[ 2.002862] EEH: PCI-E 10: 10830000 00000000 00000000 00000000
<4>[ 2.002863] EEH: PCI-E 20: 00000000
<4>[ 2.002864] EEH: PCI-E AER capability register set follows:
<4>[ 2.002875] EEH: PCI-E AER 00: 14820001 00000000 00400000 00462030
<4>[ 2.002885] EEH: PCI-E AER 10: 00000000 0000e000 000001e0 00000000
<4>[ 2.002894] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000
<4>[ 2.002897] EEH: PCI-E AER 30: 00000000 00000000
<3>[ 2.211328] ELOG: OPAL log info read failed
<3>[ 2.348038] lpfc 0004:01:00.0: 0:6101 Disabling NVME support: Not supported by firmware: 0 0
<3>[ 2.348043] lpfc 0004:01:00.0: 0:2574 IO channels: irqs 4 fcp 4 nvme 0 MRQ: 0
<3>[ 4.578036] lpfc 0004:01:00.1: 1:6101 Disabling NVME support: Not supported by firmware: 0 0
<3>[ 4.578041] lpfc 0004:01:00.1: 1:2574 IO channels: irqs 4 fcp 4 nvme 0 MRQ: 0
<3>[ 10.254238] lpfc 0005:0b:00.0: 2:1303 Link Up Event x1 received Data: x1 x1 x10 x0 x0 x0 0
<3>[ 13.159540] lpfc 0005:0b:00.1: 3:1303 Link Up Event x1 received Data: x1 x1 x10 x0 x0 x0 0
<3>[ 26.918024] INFO: rcu_sched self-detected stall on CPU
<3>[ 26.918033] 35-...: (2 GPs behind) idle=1a2/140000000000002/0 softirq=56/58 fqs=900
<3>[ 26.918033] (t=2100 jiffies g=-265 c=-266 q=730)
<1>[ 26.918048] Call Trace:
<1>[ 26.918053] [c000000ff4a0b1c0] [c0000000000926cc] sched_show_task+0xe8/0xf4 (unreliable)
<1>[ 26.918057] [c000000ff4a0b230] [c0000000000c0500] rcu_dump_cpu_stacks+0xd0/0xe0
<1>[ 26.918060] [c000000ff4a0b280] [c0000000000bfa80] rcu_check_callbacks+0x304/0x788
<1>[ 26.918062] [c000000ff4a0b3b0] [c0000000000c3054] update_process_times+0x3c/0x74
<1>[ 26.918065] [c000000ff4a0b3e0] [c0000000000d2330] tick_sched_handle.isra.3+0x50/0x64
<1>[ 26.918068] [c000000ff4a0b400] [c0000000000d2390] tick_sched_timer+0x4c/0x8c
<1>[ 26.918069] [c000000ff4a0b440] [c0000000000c3b1c] __hrtimer_run_queues+0xe0/0x16c
<1>[ 26.918071] [c000000ff4a0b4c0] [c0000000000c3dbc] hrtimer_interrupt+0xac/0x1f0
<1>[ 26.918074] [c000000ff4a0b580] [c000000000019228] __timer_interrupt+0x7c/0xdc
<1>[ 26.918076] [c000000ff4a0b5c0] [c000000000019330] timer_interrupt+0xa8/0xc0
<1>[ 26.918079] [c000000ff4a0b5f0] [c00000000000b864] fast_exception_return+0x128/0x150
<1>[ 26.918082] --- interrupt: 901 at .L144+0x0/0x4
<1>[ 26.918082] LR = arch_local_irq_restore+0x5c/0x80
<1>[ 26.918084] [c000000ff4a0b8e0] [c000000000093588] vtime_account_irq_enter+0x54/0x5c (unreliable)
<1>[ 26.918086] [c000000ff4a0b900] [c00000000006eb80] __do_softirq+0xd8/0x290
<1>[ 26.918089] [c000000ff4a0b9f0] [c00000000006efe0] irq_exit+0x88/0xe0
<1>[ 26.918091] [c000000ff4a0ba10] [c000000000019334] timer_interrupt+0xac/0xc0
<1>[ 26.918093] [c000000ff4a0ba40] [c00000000000b864] fast_exception_return+0x128/0x150
<1>[ 26.918097] --- interrupt: 901 at _restgpr0_31+0x0/0xba38
<1>[ 26.918097] LR = irq_work_queue+0x74/0xa4
<1>[ 26.918099] [c000000ff4a0bd30] [c00000000004b204] opal_handle_events+0x54/0xa8 (unreliable)
<1>[ 26.918101] [c000000ff4a0bd80] [c00000000004755c] kopald+0x44/0x70
<1>[ 26.918103] [c000000ff4a0bdc0] [c0000000000880b4] kthread+0x150/0x158
<1>[ 26.918106] [c000000ff4a0be30] [c00000000000b464] ret_from_kernel_thread+0x5c/0x78
<1>[ 90.728029] .
<1>[ 90.728073] ready
<1>[ 90.808024] .
<1>[ 90.808025] .
<1>[ 90.808027] .
<1>[ 90.808029] .
<1>[ 90.808105] ready
<1>[ 90.808110] ready
<1>[ 90.808117] ready
<1>[ 90.808128] ready
<1>[ 90.888025] .
<1>[ 90.888025] .
<1>[ 90.888026] .
<1>[ 90.888034] .
<1>[ 90.888102] ready
<1>[ 90.888108] ready
<1>[ 90.888119] ready
<1>[ 90.888189] ready
<3>[ 90.928141] sd 4:2:0:0: [sdg] Assuming drive cache: write through
<1>[ 90.968025] .
<1>[ 90.968039] .
<1>[ 90.968099] ready
<1>[ 90.968149] ready
<3>[ 91.008129] sd 4:2:1:0: [sdh] Assuming drive cache: write through
<3>[ 91.008138] sd 1:2:0:0: [sda] Assuming drive cache: write through
<3>[ 91.008145] sd 4:2:2:0: [sdi] Assuming drive cache: write through
<3>[ 91.008219] sd 1:2:1:0: [sdb] Assuming drive cache: write through
<3>[ 91.088145] sd 4:2:4:0: [sdk] Assuming drive cache: write through
<3>[ 91.088152] sd 4:2:3:0: [sdj] Assuming drive cache: write through
<3>[ 91.088156] sd 1:2:2:0: [sdc] Assuming drive cache: write through
<3>[ 91.088160] sd 1:2:3:0: [sdd] Assuming drive cache: write through
<3>[ 91.178159] sd 1:2:4:0: [sde] Assuming drive cache: write through
<3>[ 91.178162] sd 1:2:5:0: [sdf] Assuming drive cache: write through
<3>[ 93.802071] be2net 0044:01:00.0: opcode 36-1 failed:status 68-0
<3>[ 98.362057] be2net 0044:01:00.1: opcode 36-1 failed:status 68-0
<3>[ 98.828028] lpfc 0044:01:00.4: 4:6101 Disabling NVME support: Not supported by firmware: 0 0
<3>[ 98.828033] lpfc 0044:01:00.4: 4:2574 IO channels: irqs 4 fcp 4 nvme 0 MRQ: 0
<3>[ 101.698029] lpfc 0044:01:00.5: 5:6101 Disabling NVME support: Not supported by firmware: 0 0
<3>[ 101.698034] lpfc 0044:01:00.5: 5:2574 IO channels: irqs 4 fcp 4 nvme 0 MRQ: 0
<3>[ 103.838062] print_req_error: I/O error, dev sdl, sector 0
<3>[ 103.838065] Buffer I/O error on dev sdl, logical block 0, async page read
<3>[ 116.318552] print_req_error: I/O error, dev sdl, sector 0
<3>[ 116.318555] Buffer I/O error on dev sdl, logical block 0, async page read
<3>[ 118.398071] print_req_error: I/O error, dev sdl, sector 0
<3>[ 118.914736] print_req_error: I/O error, dev sdl, sector 41942912
<3>[ 119.431405] print_req_error: I/O error, dev sdl, sector 41942912
<3>[ 119.948084] print_req_error: I/O error, dev sdl, sector 0
<3>[ 120.464772] print_req_error: I/O error, dev sdl, sector 0
<3>[ 120.981446] print_req_error: I/O error, dev sdl, sector 0
<3>[ 121.533444] print_req_error: I/O error, dev sdl, sector 0
<3>[ 122.114796] print_req_error: I/O error, dev sdl, sector 0
<3>[ 122.631448] print_req_error: I/O error, dev sdl, sector 41942912
<3>[ 123.138841] print_req_error: I/O error, dev sdl, sector 41942912
<3>[ 123.648150] print_req_error: I/O error, dev sdl, sector 0
<3>[ 124.164808] print_req_error: I/O error, dev sdl, sector 0
<3>[ 124.681491] print_req_error: I/O error, dev sdl, sector 0
<3>[ 125.231501] print_req_error: I/O error, dev sdl, sector 0
<3>[ 125.814890] print_req_error: I/O error, dev sdl, sector 0
<3>[ 126.331732] print_req_error: I/O error, dev sdl, sector 41942912
<3>[ 126.848208] print_req_error: I/O error, dev sdl, sector 41942912
<3>[ 127.364870] print_req_error: I/O error, dev sdl, sector 0
<3>[ 127.881553] print_req_error: I/O error, dev sdl, sector 0
<3>[ 128.398233] print_req_error: I/O error, dev sdl, sector 0
<3>[ 128.948240] print_req_error: I/O error, dev sdl, sector 0
<3>[ 129.531646] print_req_error: I/O error, dev sdl, sector 0
<3>[ 130.048262] print_req_error: I/O error, dev sdl, sector 41942912
<3>[ 130.555625] print_req_error: I/O error, dev sdl, sector 41942912
<3>[ 131.064950] print_req_error: I/O error, dev sdl, sector 0
<3>[ 131.581625] print_req_error: I/O error, dev sdl, sector 0
<3>[ 132.098328] print_req_error: I/O error, dev sdl, sector 0
<3>[ 132.648309] print_req_error: I/O error, dev sdl, sector 0
<3>[ 133.231650] print_req_error: I/O error, dev sdl, sector 0
<3>[ 133.748325] print_req_error: I/O error, dev sdl, sector 41942912
<3>[ 134.265002] print_req_error: I/O error, dev sdl, sector 41942912
<3>[ 134.781690] print_req_error: I/O error, dev sdl, sector 0
<3>[ 135.298348] print_req_error: I/O error, dev sdl, sector 0
<3>[ 135.815028] print_req_error: I/O error, dev sdl, sector 0
<3>[ 136.365049] print_req_error: I/O error, dev sdl, sector 0
<3>[ 137.015082] print_req_error: I/O error, dev sdl, sector 0
<3>[ 137.531789] print_req_error: I/O error, dev sdl, sector 41942912
<3>[ 138.048408] print_req_error: I/O error, dev sdl, sector 41942912
<3>[ 138.565080] print_req_error: I/O error, dev sdl, sector 0
<3>[ 139.081755] print_req_error: I/O error, dev sdl, sector 0
<3>[ 139.418023] INFO: rcu_sched self-detected stall on CPU
<3>[ 139.418034] 36-...: (1 GPs behind) idle=db6/140000000000002/0 softirq=92/92 fqs=1049
<3>[ 139.418036] (t=2100 jiffies g=-238 c=-239 q=8292)
<1>[ 139.418056] Call Trace:
<1>[ 139.418061] [c000000ff4a0b210] [c0000000000926cc] sched_show_task+0xe8/0xf4 (unreliable)
<1>[ 139.418066] [c000000ff4a0b280] [c0000000000c0500] rcu_dump_cpu_stacks+0xd0/0xe0
<1>[ 139.418069] [c000000ff4a0b2d0] [c0000000000bfa80] rcu_check_callbacks+0x304/0x788
<1>[ 139.418071] [c000000ff4a0b400] [c0000000000c3054] update_process_times+0x3c/0x74
<1>[ 139.418075] [c000000ff4a0b430] [c0000000000d2330] tick_sched_handle.isra.3+0x50/0x64
<1>[ 139.418077] [c000000ff4a0b450] [c0000000000d2390] tick_sched_timer+0x4c/0x8c
<1>[ 139.418080] [c000000ff4a0b490] [c0000000000c3b1c] __hrtimer_run_queues+0xe0/0x16c
<1>[ 139.418081] [c000000ff4a0b510] [c0000000000c3dbc] hrtimer_interrupt+0xac/0x1f0
<1>[ 139.418084] [c000000ff4a0b5d0] [c000000000019228] __timer_interrupt+0x7c/0xdc
<1>[ 139.418086] [c000000ff4a0b610] [c000000000019330] timer_interrupt+0xa8/0xc0
<1>[ 139.418089] [c000000ff4a0b640] [c00000000000b864] fast_exception_return+0x128/0x150
<1>[ 139.418092] --- interrupt: 901 at .L144+0x0/0x4
<1>[ 139.418092] LR = arch_local_irq_restore+0x5c/0x80
<1>[ 139.418094] [c000000ff4a0b930] [c000000000093588] vtime_account_irq_enter+0x54/0x5c (unreliable)
<1>[ 139.418097] [c000000ff4a0b950] [c00000000006eb80] __do_softirq+0xd8/0x290
<1>[ 139.418099] [c000000ff4a0ba40] [c00000000006efe0] irq_exit+0x88/0xe0
<1>[ 139.418101] [c000000ff4a0ba60] [c000000000019334] timer_interrupt+0xac/0xc0
<1>[ 139.418103] [c000000ff4a0ba90] [c00000000000b864] fast_exception_return+0x128/0x150
<1>[ 139.418107] --- interrupt: 901 at schedule_timeout_interruptible+0x10/0x18
<1>[ 139.418107] LR = kopald+0x50/0x70
<1>[ 139.418109] [c000000ff4a0bd80] [c00000000004755c] kopald+0x44/0x70 (unreliable)
<1>[ 139.418111] [c000000ff4a0bdc0] [c0000000000880b4] kthread+0x150/0x158
<1>[ 139.418113] [c000000ff4a0be30] [c00000000000b464] ret_from_kernel_thread+0x5c/0x78
<3>[ 139.598447] print_req_error: I/O error, dev sdl, sector 0
<3>[ 140.148456] print_req_error: I/O error, dev sdl, sector 0
<3>[ 149.365309] print_req_error: I/O error, dev sdl, sector 41942912
<3>[ 149.881955] print_req_error: I/O error, dev sdl, sector 41942912
<3>[ 149.881957] Buffer I/O error on dev sdl, logical block 327679, async page read
<3>[ 174.032430] print_req_error: I/O error, dev sdn, sector 0
<3>[ 174.032433] Buffer I/O error on dev sdn, logical block 0, async page read
<3>[ 178.558233] print_req_error: I/O error, dev sdn, sector 0
<3>[ 178.558235] Buffer I/O error on dev sdn, logical block 0, async page read
<3>[ 183.038356] print_req_error: I/O error, dev sdn, sector 41942912
<3>[ 191.358282] print_req_error: I/O error, dev sdn, sector 41942912
<3>[ 191.358284] Buffer I/O error on dev sdn, logical block 327679, async page read
<3>[ 212.368021] INFO: rcu_sched self-detected stall on CPU
<3>[ 212.368026] 38-...: (1 GPs behind) idle=79a/140000000000002/0 softirq=797/797 fqs=989
<3>[ 212.368026] (t=2100 jiffies g=-220 c=-221 q=949)
<1>[ 212.368038] Call Trace:
<1>[ 212.368042] [c000000ff4a0b0e0] [c0000000000926cc] sched_show_task+0xe8/0xf4 (unreliable)
<1>[ 212.368046] [c000000ff4a0b150] [c0000000000c0500] rcu_dump_cpu_stacks+0xd0/0xe0
<1>[ 212.368048] [c000000ff4a0b1a0] [c0000000000bfa80] rcu_check_callbacks+0x304/0x788
<1>[ 212.368050] [c000000ff4a0b2d0] [c0000000000c3054] update_process_times+0x3c/0x74
<1>[ 212.368052] [c000000ff4a0b300] [c0000000000d2330] tick_sched_handle.isra.3+0x50/0x64
<1>[ 212.368054] [c000000ff4a0b320] [c0000000000d2390] tick_sched_timer+0x4c/0x8c
<1>[ 212.368056] [c000000ff4a0b360] [c0000000000c3b1c] __hrtimer_run_queues+0xe0/0x16c
<1>[ 212.368058] [c000000ff4a0b3e0] [c0000000000c3dbc] hrtimer_interrupt+0xac/0x1f0
<1>[ 212.368061] [c000000ff4a0b4a0] [c000000000019228] __timer_interrupt+0x7c/0xdc
<1>[ 212.368063] [c000000ff4a0b4e0] [c000000000019330] timer_interrupt+0xa8/0xc0
<1>[ 212.368065] [c000000ff4a0b510] [c00000000000b864] fast_exception_return+0x128/0x150
<1>[ 212.368068] --- interrupt: 901 at .L144+0x0/0x4
<1>[ 212.368068] LR = arch_local_irq_restore+0x5c/0x80
<1>[ 212.368070] [c000000ff4a0b800] [c000000000093588] vtime_account_irq_enter+0x54/0x5c (unreliable)
<1>[ 212.368073] [c000000ff4a0b820] [c00000000006eb80] __do_softirq+0xd8/0x290
<1>[ 212.368075] [c000000ff4a0b910] [c00000000006efe0] irq_exit+0x88/0xe0
<1>[ 212.368077] [c000000ff4a0b930] [c000000000019334] timer_interrupt+0xac/0xc0
<1>[ 212.368079] [c000000ff4a0b960] [c00000000000b864] fast_exception_return+0x128/0x150
<1>[ 212.368082] --- interrupt: 901 at lock_timer_base+0x44/0xa4
<1>[ 212.368082] LR = lock_timer_base+0x10/0xa4
<1>[ 212.368083] [c000000ff4a0bc50] [c000000000879a80] 0xc000000000879a80 (unreliable)
<1>[ 212.368086] [c000000ff4a0bca0] [c0000000005f33b8] schedule_timeout+0xc0/0x278
<1>[ 212.368088] [c000000ff4a0bd80] [c000000000047568] kopald+0x50/0x70
<1>[ 212.368090] [c000000ff4a0bdc0] [c0000000000880b4] kthread+0x150/0x158
<1>[ 212.368092] [c000000ff4a0be30] [c00000000000b464] ret_from_kernel_thread+0x5c/0x78
<3>[ 233.608027] INFO: rcu_sched self-detected stall on CPU
<3>[ 233.608031] 38-...: (1 GPs behind) idle=7a6/140000000000002/0 softirq=798/799 fqs=840
<3>[ 233.608032] (t=2100 jiffies g=-219 c=-220 q=998)
<1>[ 233.608041] Call Trace:
<1>[ 233.608044] [c000000ff4a0b210] [c0000000000926cc] sched_show_task+0xe8/0xf4 (unreliable)
<1>[ 233.608047] [c000000ff4a0b280] [c0000000000c0500] rcu_dump_cpu_stacks+0xd0/0xe0
<1>[ 233.608049] [c000000ff4a0b2d0] [c0000000000bfa80] rcu_check_callbacks+0x304/0x788
<1>[ 233.608051] [c000000ff4a0b400] [c0000000000c3054] update_process_times+0x3c/0x74
<1>[ 233.608053] [c000000ff4a0b430] [c0000000000d2330] tick_sched_handle.isra.3+0x50/0x64
<1>[ 233.608055] [c000000ff4a0b450] [c0000000000d2390] tick_sched_timer+0x4c/0x8c
<1>[ 233.608057] [c000000ff4a0b490] [c0000000000c3b1c] __hrtimer_run_queues+0xe0/0x16c
<1>[ 233.608059] [c000000ff4a0b510] [c0000000000c3dbc] hrtimer_interrupt+0xac/0x1f0
<1>[ 233.608061] [c000000ff4a0b5d0] [c000000000019228] __timer_interrupt+0x7c/0xdc
<1>[ 233.608063] [c000000ff4a0b610] [c000000000019330] timer_interrupt+0xa8/0xc0
<1>[ 233.608065] [c000000ff4a0b640] [c00000000000b864] fast_exception_return+0x128/0x150
<1>[ 233.608068] --- interrupt: 901 at .L144+0x0/0x4
<1>[ 233.608068] LR = arch_local_irq_restore+0x5c/0x80
<1>[ 233.608070] [c000000ff4a0b930] [c000000000093588] vtime_account_irq_enter+0x54/0x5c (unreliable)
<1>[ 233.608073] [c000000ff4a0b950] [c00000000006eb80] __do_softirq+0xd8/0x290
<1>[ 233.608075] [c000000ff4a0ba40] [c00000000006efe0] irq_exit+0x88/0xe0
<1>[ 233.608077] [c000000ff4a0ba60] [c000000000019334] timer_interrupt+0xac/0xc0
<1>[ 233.608079] [c000000ff4a0ba90] [c00000000000b864] fast_exception_return+0x128/0x150
<1>[ 233.608082] --- interrupt: 901 at schedule_timeout+0x1c/0x278
<1>[ 233.608082] LR = schedule_timeout+0x1c/0x278
<1>[ 233.608083] [c000000ff4a0bd80] [c000000000047568] kopald+0x50/0x70 (unreliable)
<1>[ 233.608086] [c000000ff4a0bdc0] [c0000000000880b4] kthread+0x150/0x158
<1>[ 233.608088] [c000000ff4a0be30] [c00000000000b464] ret_from_kernel_thread+0x5c/0x78
<3>[ 244.283717] print_req_error: I/O error, dev sdp, sector 0
<3>[ 244.283719] Buffer I/O error on dev sdp, logical block 0, async page read
<3>[ 244.800392] print_req_error: I/O error, dev sdp, sector 0
<3>[ 244.800394] Buffer I/O error on dev sdp, logical block 0, async page read
<3>[ 245.317070] print_req_error: I/O error, dev sdp, sector 41942912
<3>[ 245.833807] print_req_error: I/O error, dev sdp, sector 41942912
<3>[ 245.833808] Buffer I/O error on dev sdp, logical block 327679, async page read
<3>[ 264.798018] INFO: rcu_sched detected stalls on CPUs/tasks:
<3>[ 264.798027] (detected by 11, t=2102 jiffies, g=-215, c=-216, q=969)
<3>[ 264.798040] All QSes seen, last rcu_sched kthread activity 218 (4294963776-4294963558), jiffies_till_next_fqs=1, root ->qsmask 0x0
<1>[ 264.798046] Call Trace:
<1>[ 264.798051] [c000000ffa25f700] [c0000000000926cc] sched_show_task+0xe8/0xf4 (unreliable)
<1>[ 264.798056] [c000000ffa25f770] [c0000000000bfe38] rcu_check_callbacks+0x6bc/0x788
<1>[ 264.798059] [c000000ffa25f8a0] [c0000000000c3054] update_process_times+0x3c/0x74
<1>[ 264.798063] [c000000ffa25f8d0] [c0000000000d2330] tick_sched_handle.isra.3+0x50/0x64
<1>[ 264.798067] [c000000ffa25f8f0] [c0000000000d2390] tick_sched_timer+0x4c/0x8c
<1>[ 264.798070] [c000000ffa25f930] [c0000000000c3b1c] __hrtimer_run_queues+0xe0/0x16c
<1>[ 264.798073] [c000000ffa25f9b0] [c0000000000c3dbc] hrtimer_interrupt+0xac/0x1f0
<1>[ 264.798077] [c000000ffa25fa70] [c000000000019228] __timer_interrupt+0x7c/0xdc
<1>[ 264.798080] [c000000ffa25fab0] [c000000000019330] timer_interrupt+0xa8/0xc0
<1>[ 264.798084] [c000000ffa25fae0] [c000000000009008] decrementer_common+0x128/0x130
<1>[ 264.798089] --- interrupt: 901 at .L144+0x0/0x4
<1>[ 264.798089] LR = arch_local_irq_restore+0x5c/0x80
<1>[ 264.798090] [c000000ffa25fdd0] [0000000000000000] (null) (unreliable)
<1>[ 264.798095] [c000000ffa25fdf0] [c00000000050a53c] cpuidle_enter_state+0x1b0/0x214
<1>[ 264.798098] [c000000ffa25fe40] [c0000000000a2c50] call_cpuidle+0x6c/0x74
<1>[ 264.798101] [c000000ffa25fe60] [c0000000000a2ef8] do_idle+0x1f0/0x204
<1>[ 264.798105] [c000000ffa25fec0] [c0000000000a3098] cpu_startup_entry+0x30/0x34
<1>[ 264.798109] [c000000ffa25fef0] [c00000000002a3e8] start_secondary+0x4cc/0x4e4
<1>[ 264.798112] [c000000ffa25ff90] [c00000000000aa6c] start_secondary_prolog+0x10/0x14
<3>[ 264.798116] rcu_sched kthread starved for 218 jiffies! g18446744073709551401 c18446744073709551400 f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0
<1>[ 264.798121] Call Trace:
<1>[ 264.798123] [c000000ffa2278e0] [0000002400000000] 0x2400000000 (unreliable)
<1>[ 264.798127] [c000000ffa227ab0] [c000000000013a18] __switch_to+0x188/0x210
<1>[ 264.798130] [c000000ffa227b10] [c0000000005efab4] __schedule+0x464/0x840
<1>[ 264.798133] [c000000ffa227be0] [c0000000005eff14] schedule+0x84/0x98
<1>[ 264.798136] [c000000ffa227c10] [c0000000005f3500] schedule_timeout+0x208/0x278
<1>[ 264.798139] [c000000ffa227cf0] [c0000000000be548] rcu_gp_kthread+0x824/0x830
<1>[ 264.798142] [c000000ffa227dc0] [c0000000000880b4] kthread+0x150/0x158
<1>[ 264.798146] [c000000ffa227e30] [c00000000000b464] ret_from_kernel_thread+0x5c/0x78
<3>[ 289.808023] INFO: rcu_sched self-detected stall on CPU
<3>[ 289.808032] 39-...: (1 GPs behind) idle=cb6/140000000000002/0 softirq=3050/3051 fqs=1030
<3>[ 289.808033] (t=2100 jiffies g=-214 c=-215 q=1359)
<1>[ 289.808049] Call Trace:
<1>[ 289.808053] [c000000ff4a0b130] [c0000000000926cc] sched_show_task+0xe8/0xf4 (unreliable)
<1>[ 289.808056] [c000000ff4a0b1a0] [c0000000000c0500] rcu_dump_cpu_stacks+0xd0/0xe0
<1>[ 289.808058] [c000000ff4a0b1f0] [c0000000000bfa80] rcu_check_callbacks+0x304/0x788
<1>[ 289.808060] [c000000ff4a0b320] [c0000000000c3054] update_process_times+0x3c/0x74
<1>[ 289.808063] [c000000ff4a0b350] [c0000000000d2330] tick_sched_handle.isra.3+0x50/0x64
<1>[ 289.808065] [c000000ff4a0b370] [c0000000000d2390] tick_sched_timer+0x4c/0x8c
<1>[ 289.808067] [c000000ff4a0b3b0] [c0000000000c3b1c] __hrtimer_run_queues+0xe0/0x16c
<1>[ 289.808069] [c000000ff4a0b430] [c0000000000c3dbc] hrtimer_interrupt+0xac/0x1f0
<1>[ 289.808071] [c000000ff4a0b4f0] [c000000000019228] __timer_interrupt+0x7c/0xdc
<1>[ 289.808073] [c000000ff4a0b530] [c000000000019330] timer_interrupt+0xa8/0xc0
<1>[ 289.808075] [c000000ff4a0b560] [c00000000000b864] fast_exception_return+0x128/0x150
<1>[ 289.808079] --- interrupt: 901 at .L144+0x0/0x4
<1>[ 289.808079] LR = arch_local_irq_restore+0x5c/0x80
<1>[ 289.808080] [c000000ff4a0b850] [c000000000093588] vtime_account_irq_enter+0x54/0x5c (unreliable)
<1>[ 289.808083] [c000000ff4a0b870] [c00000000006eb80] __do_softirq+0xd8/0x290
<1>[ 289.808085] [c000000ff4a0b960] [c00000000006efe0] irq_exit+0x88/0xe0
<1>[ 289.808087] [c000000ff4a0b980] [c000000000019334] timer_interrupt+0xac/0xc0
<1>[ 289.808089] [c000000ff4a0b9b0] [c00000000000b864] fast_exception_return+0x128/0x150
<1>[ 289.808093] --- interrupt: 901 at schedule_timeout+0x88/0x278
<1>[ 289.808093] LR = schedule_timeout+0x1c/0x278
<1>[ 289.808094] [c000000ff4a0bca0] [c0000000005f350c] schedule_timeout+0x214/0x278 (unreliable)
<1>[ 289.808097] [c000000ff4a0bd80] [c000000000047568] kopald+0x50/0x70
<1>[ 289.808099] [c000000ff4a0bdc0] [c0000000000880b4] kthread+0x150/0x158
<1>[ 289.808101] [c000000ff4a0be30] [c00000000000b464] ret_from_kernel_thread+0x5c/0x78
<3>[ 335.728022] INFO: rcu_sched self-detected stall on CPU
<3>[ 335.728027] 24-...: (1 GPs behind) idle=c0e/140000000000002/0 softirq=461/461 fqs=1049
<3>[ 335.728027] (t=2100 jiffies g=-211 c=-212 q=516)
<1>[ 335.728044] Call Trace:
<1>[ 335.728047] [c000000ff4a0b1c0] [c0000000000926cc] sched_show_task+0xe8/0xf4 (unreliable)
<1>[ 335.728050] [c000000ff4a0b230] [c0000000000c0500] rcu_dump_cpu_stacks+0xd0/0xe0
<1>[ 335.728053] [c000000ff4a0b280] [c0000000000bfa80] rcu_check_callbacks+0x304/0x788
<1>[ 335.728055] [c000000ff4a0b3b0] [c0000000000c3054] update_process_times+0x3c/0x74
<1>[ 335.728057] [c000000ff4a0b3e0] [c0000000000d2330] tick_sched_handle.isra.3+0x50/0x64
<1>[ 335.728059] [c000000ff4a0b400] [c0000000000d2390] tick_sched_timer+0x4c/0x8c
<1>[ 335.728061] [c000000ff4a0b440] [c0000000000c3b1c] __hrtimer_run_queues+0xe0/0x16c
<1>[ 335.728063] [c000000ff4a0b4c0] [c0000000000c3dbc] hrtimer_interrupt+0xac/0x1f0
<1>[ 335.728065] [c000000ff4a0b580] [c000000000019228] __timer_interrupt+0x7c/0xdc
<1>[ 335.728067] [c000000ff4a0b5c0] [c000000000019330] timer_interrupt+0xa8/0xc0
<1>[ 335.728069] [c000000ff4a0b5f0] [c00000000000b864] fast_exception_return+0x128/0x150
<1>[ 335.728073] --- interrupt: 901 at .L144+0x0/0x4
<1>[ 335.728073] LR = arch_local_irq_restore+0x5c/0x80
<1>[ 335.728074] [c000000ff4a0b8e0] [c000000000093588] vtime_account_irq_enter+0x54/0x5c (unreliable)
<1>[ 335.728077] [c000000ff4a0b900] [c00000000006eb80] __do_softirq+0xd8/0x290
<1>[ 335.728079] [c000000ff4a0b9f0] [c00000000006efe0] irq_exit+0x88/0xe0
<1>[ 335.728081] [c000000ff4a0ba10] [c000000000019334] timer_interrupt+0xac/0xc0
<1>[ 335.728083] [c000000ff4a0ba40] [c00000000000b864] fast_exception_return+0x128/0x150
<1>[ 335.728087] --- interrupt: 901 at _restgpr0_31+0x0/0xba38
<1>[ 335.728087] LR = irq_work_queue+0x74/0xa4
<1>[ 335.728089] [c000000ff4a0bd30] [c00000000004b204] opal_handle_events+0x54/0xa8 (unreliable)
<1>[ 335.728091] [c000000ff4a0bd80] [c00000000004755c] kopald+0x44/0x70
<1>[ 335.728093] [c000000ff4a0bdc0] [c0000000000880b4] kthread+0x150/0x158
<1>[ 335.728096] [c000000ff4a0be30] [c00000000000b464] ret_from_kernel_thread+0x5c/0x78
/ #
@npiggin Thanks, Have you done the tests mention in this issue against your patches?
No I haven't had a chance to yet, I've been caught up doing P9 work at the moment.
I brought the patch up to date and resubmitted it, but... I don't think it solves the root cause of the problem, there must be a long running OPAL event that triggers it AFAIKS. Maybe EEH?
(bringing over from internal bugzilla)
Pridhiviraj Paidipeddi 2016-12-24 00:15:00 AEDT:
Install a 8284-22A PowerNV system with FW860.20 latest firmware build, and install a Ubuntu 16.10. Then below test steps caused the NMI Watchdog: Bug soft lockup issue in petitboot kernel.
Test-steps:
Boot the system to runtime(i,e into petitboot)
Enable the fast-reboot by using below command / # nvram -p ibm,skiboot --update-config experimental-fast-reset=feeling-lucky / # nvram --print-config=experimental-fast-reset -p ibm,skiboot feeling-lucky
Then do a FSP ResetReload test smgr resetReload
Once FSP comes UP, do reboot in petitboot.
After system reaching to peitiboot, check dmesg for softlockup's