open-power / op-build

Buildroot overlay for Open Power
GNU General Public License v2.0
104 stars 183 forks source link

smp_call_function_single/many core hang issue with stop4 alone. #2035

Open pridhiviraj opened 6 years ago

pridhiviraj commented 6 years ago

We have long long ago bug still existed even in latest upstream PNOR as well as in official released PNOR's. @stewart-ibm This issue occurring in both your witherspoon system which is having latest code, and also got system from vaidy with released PNOR, which is also hitting the issue.

 cat /var/log//obmc-console.log 
120
[12469.236833] [c000003fe566b730] [c000000000024d0c] timer_interrupt+0x9c/0xe0
[12469.236836] [c000003fe566b760] [c000000000009014] decrementer_common+0x114/0x120
[12469.236841] --- interrupt: 901 at doorbell_global_ipi+0x34/0x50
[12469.236841]     LR = arch_send_call_function_ipi_mask+0x120/0x130
[12469.236843] [c000003fe566ba50] [c00000000004876c] arch_send_call_function_ipi_mask+0x4c/0x130 (unreliable)
[12469.236846] [c000003fe566ba90] [c0000000001d59f0] smp_call_function_many+0x340/0x450
[12469.236849] [c000003fe566bb00] [c000000000075f18] pmdp_invalidate+0x98/0xe0
[12469.236852] [c000003fe566bb30] [c0000000003a1120] change_huge_pmd+0xe0/0x270
[12469.236855] [c000003fe566bba0] [c000000000349278] change_protection_range+0xb88/0xe40
[12469.236858] [c000003fe566bcf0] [c0000000003496c0] mprotect_fixup+0x140/0x340
[12469.236861] [c000003fe566bdb0] [c000000000349a74] SyS_mprotect+0x1b4/0x350
[12469.236864] [c000003fe566be30] [c00000000000b184] system_call+0x58/0x6c
[12469.236865] Instruction dump:
[12469.236867] 7ca62b78 7c852378 7fe4fb78 4bfffd4d 813f0018 71290001 4182002c 48000014 
[12469.236872] 60000000 60000000 60000000 60420000 <7c210b78> 7c421378 813f0018 71290001 
[12498.816287] systemd[1]: bumblebeed.service: Service hold-off time over, scheduling restart.
[12498.816655] systemd[1]: bumblebeed.service: Scheduled restart job, restart counter is at 207.
[12498.816928] systemd[1]: Stopped Bumblebee C Daemon.
[12498.818324] systemd[1]: Started Bumblebee C Daemon.
[12498.822139] systemd[1]: bumblebeed.service: Main process exited, code=exited, status=1/FAILURE
[12498.822368] systemd[1]: bumblebeed.service: Failed with result 'exit-code'.
[12519.317075] systemd[1]: systemd-journald.service: State 'stop-sigabrt' timed out. Terminating.
[12532.241634] INFO: rcu_sched self-detected stall on CPU
[12532.241695]  54-....: (14 GPs behind) idle=d2a/140000000000002/0 softirq=61680/61681 fqs=30873 
[12532.241803]   (t=68270 jiffies g=52859 c=52858 q=5658)
[12532.241884] Sending NMI from CPU 54 to CPUs 26:
[12532.242074] NMI backtrace for cpu 26
[12532.242076] CPU: 26 PID: 9742 Comm: tlbie_test Tainted: P           OEL   4.15.0-15-generic #16-Ubuntu
[12532.242078] NIP:  c0000000001d5a60 LR: c0000000001d5a0c CTR: c000000000044270
[12532.242079] REGS: c00000003fec3d80 TRAP: 0100   Tainted: P           OEL    (4.15.0-15-generic)
[12532.242080] MSR:  9000000000089033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 44024824  XER: 00000000
[12532.242086] CFAR: c0000000001d5a70 SOFTE: 1 
[12532.242086] GPR00: c0000000001d5a0c c0000000032d3a90 c0000000016eb400 0000000000000012 
[12532.242086] GPR04: 0000000000000012 0000000000000012 0000000000000000 c00000000171dd78 
[12532.242086] GPR08: 0000000000000012 0000000000000001 c000003fef41bee0 c000003fef619800 
[12532.242086] GPR12: c000000000044270 c00000000fa91e00 000072fee4800000 000072f5ee280000 
[12532.242086] GPR16: c000003fd5abcb00 000072fee4600000 0000000000000000 000072f5eea8f240 
[12532.242086] GPR20: 4000000000000800 c00000000188e9f0 0000000000000000 0000000000000001 
[12532.242086] GPR24: c0000000011d9600 c000003fef619808 0000000000000000 c0000000000759c0 
[12532.242086] GPR28: c000000001722484 c000003fef619708 c000003fef619700 0000000000000800 
[12532.242111] NIP [c0000000001d5a60] smp_call_function_many+0x3b0/0x450
[12532.242113] LR [c0000000001d5a0c] smp_call_function_many+0x35c/0x450
[12532.242113] Call Trace:
[12532.242116] [c0000000032d3a90] [c0000000001d5a0c] smp_call_function_many+0x35c/0x450 (unreliable)
[12532.242120] [c0000000032d3b00] [c000000000075f18] pmdp_invalidate+0x98/0xe0
[12532.242123] [c0000000032d3b30] [c0000000003a1120] change_huge_pmd+0xe0/0x270
[12532.242126] [c0000000032d3ba0] [c000000000349278] change_protection_range+0xb88/0xe40
[12532.242129] [c0000000032d3cf0] [c0000000003496c0] mprotect_fixup+0x140/0x340
[12532.242132] [c0000000032d3db0] [c000000000349a74] SyS_mprotect+0x1b4/0x350
[12532.242135] [c0000000032d3e30] [c00000000000b184] system_call+0x58/0x6c
[12532.242136] Instruction dump:
[12532.242138] 7d27482a 7d4a4a14 812a0018 71290001 41820034 4800001c 60000000 60000000 
[12532.242143] 60000000 60000000 60000000 60420000 <7c210b78> 7c421378 812a0018 71290001 
[12532.245632] INFO: rcu_sched self-detected stall on CPU
[12532.252139]  26-....: (68252 ticks this GP) idle=22a/140000000000001/0 softirq=81224/81224 fqs=30874 
[12532.252140] NMI backtrace for cpu 54
[12532.252144] CPU: 54 PID: 9796 Comm: tlbie_test Tainted: P           OEL   4.15.0-15-generic #16-Ubuntu
[12532.252145] Call Trace:
[12532.252147] [c000003fe566aac0] [c000000000ceb35c] dump_stack+0xb0/0xf4 (unreliable)
[12532.252151] [c000003fe566ab00] [c000000000cf4d48] nmi_cpu_backtrace+0x1f8/0x200
[12532.252154] [c000003fe566ab90] [c000000000cf4ee8] nmi_trigger_cpumask_backtrace+0x198/0x1f0
[12532.252158] [c000003fe566ac20] [c00000000002f2d8] arch_trigger_cpumask_backtrace+0x28/0x40
[12532.252161] [c000003fe566ac40] [c0000000001a913c] rcu_dump_cpu_stacks+0xf4/0x158
[12532.252164] [c000003fe566ac90] [c0000000001a81e8] rcu_check_callbacks+0x8e8/0xb40
[12532.252168] [c000003fe566adc0] [c0000000001b64a8] update_process_times+0x48/0x90
[12532.252170] [c000003fe566adf0] [c0000000001ce1f4] tick_sched_handle.isra.5+0x34/0xd0
[12532.252173] [c000003fe566ae20] [c0000000001ce2f0] tick_sched_timer+0x60/0xe0
[12532.252176] [c000003fe566ae60] [c0000000001b7054] __hrtimer_run_queues+0x144/0x370
[12532.252179] [c000003fe566aee0] [c0000000001b7fac] hrtimer_interrupt+0xfc/0x350
[12532.252182] [c000003fe566afb0] [c0000000000248f0] __timer_interrupt+0x90/0x260
[12532.252186] [c000003fe566b000] [c000000000024d08] timer_interrupt+0x98/0xe0
[12532.252189] [c000003fe566b030] [c000000000009014] decrementer_common+0x114/0x120
[12532.252193] --- interrupt: 901 at smp_call_function_single+0x138/0x180
[12532.252193]     LR = smp_call_function_single+0x110/0x180
[12532.252195] [c000003fe566b390] [c0000000001d55e0] smp_call_function_any+0x180/0x250
[12532.252198] [c000003fe566b3f0] [c000000000acd3e8] gpstate_timer_handler+0x1e8/0x580
[12532.252201] [c000003fe566b4a0] [c0000000001b46b0] call_timer_fn+0x50/0x1c0
[12532.252204] [c000003fe566b520] [c0000000001b4958] expire_timers+0x138/0x1f0
[12532.252207] [c000003fe566b590] [c0000000001b4bf8] run_timer_softirq+0x1e8/0x270
[12532.252210] [c000003fe566b630] [c000000000d0d6c8] __do_softirq+0x158/0x3e4
[12532.252213] [c000003fe566b710] [c000000000114be8] irq_exit+0xe8/0x120
[12532.252216] [c000003fe566b730] [c000000000024d0c] timer_interrupt+0x9c/0xe0
[12532.252219] [c000003fe566b760] [c000000000009014] decrementer_common+0x114/0x120
[12532.252224] --- interrupt: 901 at doorbell_global_ipi+0x34/0x50
[12532.252224]     LR = arch_send_call_function_ipi_mask+0x120/0x130
[12532.252226] [c000003fe566ba50] [c00000000004876c] arch_send_call_function_ipi_mask+0x4c/0x130 (unreliable)
[12532.252229] [c000003fe566ba90] [c0000000001d59f0] smp_call_function_many+0x340/0x450
[12532.252233] [c000003fe566bb00] [c000000000075f18] pmdp_invalidate+0x98/0xe0
[12532.252238] [c000003fe566bb30] [c0000000003a1120] change_huge_pmd+0xe0/0x270
[12532.252241] [c000003fe566bba0] [c000000000349278] change_protection_range+0xb88/0xe40
[12532.252244] [c000003fe566bcf0] [c0000000003496c0] mprotect_fixup+0x140/0x340
[12532.252247] [c000003fe566bdb0] [c000000000349a74] SyS_mprotect+0x1b4/0x350
[12532.252250] [c000003fe566be30] [c00000000000b184] system_call+0x58/0x6c
[12532.256157]   (t=68273 jiffies g=52859 c=52858 q=5658)
[12532.256205] NMI backtrace for cpu 26
[12532.256282] CPU: 26 PID: 9742 Comm: tlbie_test Tainted: P           OEL   4.15.0-15-generic #16-Ubuntu
[12532.256391] Call Trace:
[12532.256435] [c0000000032d3230] [c000000000ceb35c] dump_stack+0xb0/0xf4 (unreliable)
[12532.256575] [c0000000032d3270] [c000000000cf4d48] nmi_cpu_backtrace+0x1f8/0x200
[12532.256699] [c0000000032d3300] [c000000000cf4ee8] nmi_trigger_cpumask_backtrace+0x198/0x1f0
[12532.256840] [c0000000032d3390] [c00000000002f2d8] arch_trigger_cpumask_backtrace+0x28/0x40
[12532.256969] [c0000000032d33b0] [c0000000001a913c] rcu_dump_cpu_stacks+0xf4/0x158
[12532.257095] [c0000000032d3400] [c0000000001a81e8] rcu_check_callbacks+0x8e8/0xb40
[12532.257233] [c0000000032d3530] [c0000000001b64a8] update_process_times+0x48/0x90
[12532.257382] [c0000000032d3560] [c0000000001ce1f4] tick_sched_handle.isra.5+0x34/0xd0
[12532.257523] [c0000000032d3590] [c0000000001ce2f0] tick_sched_timer+0x60/0xe0
[12532.257667] [c0000000032d35d0] [c0000000001b7054] __hrtimer_run_queues+0x144/0x370
[12532.257778] [c0000000032d3650] [c0000000001b7fac] hrtimer_interrupt+0xfc/0x350
[12532.257887] [c0000000032d3720] [c0000000000248f0] __timer_interrupt+0x90/0x260
[12532.258001] [c0000000032d3770] [c000000000024d08] timer_interrupt+0x98/0xe0
[12532.258101] [c0000000032d37a0] [c000000000009014] decrementer_common+0x114/0x120
[12532.258214] --- interrupt: 901 at smp_call_function_many+0x3b0/0x450
[12532.258214]     LR = smp_call_function_many+0x35c/0x450
[12532.258378] [c0000000032d3b00] [c000000000075f18] pmdp_invalidate+0x98/0xe0
[12532.258474] [c0000000032d3b30] [c0000000003a1120] change_huge_pmd+0xe0/0x270
[12532.258569] [c0000000032d3ba0] [c000000000349278] change_protection_range+0xb88/0xe40
[12532.258684] [c0000000032d3cf0] [c0000000003496c0] mprotect_fixup+0x140/0x340
[12532.258793] [c0000000032d3db0] [c000000000349a74] SyS_mprotect+0x1b4/0x350
[12532.258892] [c0000000032d3e30] [c00000000000b184] system_call+0x58/0x6c
[12532.259037] Sending NMI from CPU 26 to CPUs 54:
[12532.259218] NMI backtrace for cpu 54
[12532.259220] CPU: 54 PID: 9796 Comm: tlbie_test Tainted: P           OEL   4.15.0-15-generic #16-Ubuntu
[12532.259222] NIP:  c0000000001d5360 LR: c0000000001d5340 CTR: c000000000acd200
[12532.259223] REGS: c00000003fd73d80 TRAP: 0100   Tainted: P           OEL    (4.15.0-15-generic)
[12532.259224] MSR:  9000000000089033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 48024242  XER: 00000000
[12532.259230] CFAR: c0000000001d5370 SOFTE: 1 
[12532.259230] GPR00: c0000000001d5340 c000003fe566b320 c0000000016eb400 0000000000000000 
[12532.259230] GPR04: c000003fe566b340 c000003fef999600 c000003fe566b450 c000000001721ed8 
[12532.259230] GPR08: c000000001721ed8 0000000000000001 c000003fef99c420 0000000000000000 
[12532.259230] GPR12: c000000000acd200 c00000000faa5200 00007e765bc00000 00007e723ffd0000 
[12532.259230] GPR16: c000003f991b0500 0000000000000100 0000000000000001 0000000000000010 
[12532.259230] GPR20: c000000001712208 0000000000400040 0000000000000000 c000000001713b00 
[12532.259230] GPR24: 00000001002d9e5f 5deadbeef0000200 c000000001a4654c c000003fe20425f8 
[12532.259230] GPR28: 0000000000000001 c000003fe566b450 c000000000acc860 c000003fe566b340 
[12532.259255] NIP [c0000000001d5360] smp_call_function_single+0x130/0x180
[12532.259257] LR [c0000000001d5340] smp_call_function_single+0x110/0x180
[12532.259257] Call Trace:
[12532.259260] [c000003fe566b320] [c0000000001d5340] smp_call_function_single+0x110/0x180 (unreliable)
[12532.259263] [c000003fe566b390] [c0000000001d55e0] smp_call_function_any+0x180/0x250
[12532.259265] [c000003fe566b3f0] [c000000000acd3e8] gpstate_timer_handler+0x1e8/0x580
[12532.259268] [c000003fe566b4a0] [c0000000001b46b0] call_timer_fn+0x50/0x1c0
[12532.259271] [c000003fe566b520] [c0000000001b4958] expire_timers+0x138/0x1f0
[12532.259274] [c000003fe566b590] [c0000000001b4bf8] run_timer_softirq+0x1e8/0x270
[12532.259277] [c000003fe566b630] [c000000000d0d6c8] __do_softirq+0x158/0x3e4
[12532.259280] [c000003fe566b710] [c000000000114be8] irq_exit+0xe8/0x120
[12532.259283] [c000003fe566b730] [c000000000024d0c] timer_interrupt+0x9c/0xe0
[12532.259286] [c000003fe566b760] [c000000000009014] decrementer_common+0x114/0x120
[12532.259290] --- interrupt: 901 at doorbell_global_ipi+0x34/0x50
[12532.259290]     LR = arch_send_call_function_ipi_mask+0x120/0x130
[12532.259292] [c000003fe566ba50] [c00000000004876c] arch_send_call_function_ipi_mask+0x4c/0x130 (unreliable)
[12532.259295] [c000003fe566ba90] [c0000000001d59f0] smp_call_function_many+0x340/0x450
[12532.259299] [c000003fe566bb00] [c000000000075f18] pmdp_invalidate+0x98/0xe0
[12532.259302] [c000003fe566bb30] [c0000000003a1120] change_huge_pmd+0xe0/0x270
[12532.259305] [c000003fe566bba0] [c000000000349278] change_protection_range+0xb88/0xe40
[12532.259308] [c000003fe566bcf0] [c0000000003496c0] mprotect_fixup+0x140/0x340
[12532.259311] [c000003fe566bdb0] [c000000000349a74] SyS_mprotect+0x1b4/0x350
[12532.259314] [c000003fe566be30] [c00000000000b184] system_call+0x58/0x6c
[12532.259315] Instruction dump:
[12532.259316] 7ca62b78 7c852378 7fe4fb78 4bfffd4d 813f0018 71290001 4182002c 48000014 
[12532.259321] 60000000 60000000 60000000 60420000 <7c210b78> 7c421378 813f0018 71290001 
[12559.068601] systemd[1]: bumblebeed.service: Service hold-off time over, scheduling restart.
[12559.068974] systemd[1]: bumblebeed.service: Scheduled restart job, restart counter is at 208.
[12559.069267] systemd[1]: Stopped Bumblebee C Daemon.
[12559.070657] systemd[1]: Started Bumblebee C Daemon.
[12559.074289] systemd[1]: bumblebeed.service: Main process exited, code=exited, status=1/FAILURE
[12559.074550] systemd[1]: bumblebeed.service: Failed with result 'exit-code'.
[12595.260062] INFO: rcu_sched self-detected stall[12735.417103315,3] OPAL exiting with locks held, token=145 retval=0
[12735.417187276,3]   core/lock.c:216
[12735.417212259,3]   core/lock.c:216
[12735.424663057,5] Unable to log error
 on CPU
[12595.260063] INFO: rcu_sched self-detected stall on CPU
[12595.260070]  54-....: (14 GPsroot@witherspoon:~# 
root@witherspoon:~# 

PNOR Level:


cat /var/lib//phosphor-software-manager/pnor//ro//VERSION  
open-power-witherspoon-v1.22-8-g3f0e29c-dirty
    buildroot-2018.02-2-g05e5240
    skiboot-v5.11-30-gcb16e55a234b
    hostboot-1e784c0
    linux-4.15.14-openpower1-p7016d80
    petitboot-v1.7.1-pd695626
    machine-xml-07a5d2c
    occ-6c9f28e
    hostboot-binaries-7628674
    capp-ucode-p9-dd2-v4
    sbe-783b420```
pridhiviraj commented 6 years ago

@stewart-ibm Used below tests reproduce this issue, it is occurring on more systems now.

tlbie_test
============
two instances

CPU hotplug tests;
=========================
cat cpu_hotplug.sh
#!/bin/bash
set -x
for ((i=0; i<100000; i++))
do
    ppc64_cpu --cores-on=1
    ppc64_cpu --cores-on=32
done

cpu frequency read tests:
========================
cat read_frequency.sh
#!/bin/bash
set -x
for ((i=0; i<100000; i++))
do
    ppc64_cpu --frequency
done

CPU Governor change test:
===========================
cat cpu_freq_gov1.sh 
#!/bin/bash
set -x
CPU_SYS_DIR=/sys/devices/system/cpu/
FIRST_AVAIL_CPU=$(cat /sys/devices/system/cpu/present | cut -d'-' -f1)
res=$(cat /sys/devices/system/cpu/cpu$FIRST_AVAIL_CPU/cpufreq/scaling_available_governors)
IFS=' ' read -r -a array <<< "$res"
for ((i=0; i<1000; i++))
do
    for gov in "${array[@]}"
    do
        for j in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo $gov > $j; done
    done
done
shriyak commented 6 years ago

Hitting Hardlock up issue on Witherspoon machine (DD2.2) with 1808D + stop4 enabled using attribute override.

OS : Ubuntu 1804 Kernel : 4.15.0-15-generic

Call trace

[ 2020.139747] Watchdog CPU:71 detected Hard LOCKUP other CPUS:42 [ 2020.139931] Watchdog CPU:42 Hard LOCKUP [ 2020.139933] Modules linked in: ibmpowernv vmx_crypto ofpart idt_89hpesx ipmi_powernv at24 cmdlinepart uio_pdrv_genirq uio opal_prd crct10dif_vpmsum powernv_flash mtd ipmi_devintf ipmi_msghandler nfsd auth_rpcgss nfs_acl lockd grace sch_fq_codel sunrpc ip_tables x_tables autofs4 ast i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ahci crc32c_vpmsum drm tg3 libahci [ 2020.139965] CPU: 42 PID: 3391 Comm: a.out Not tainted 4.15.0-15-generic #16-Ubuntu [ 2020.139967] NIP: c0000000001d5364 LR: c0000000001d5340 CTR: c000000000acd200 [ 2020.139969] REGS: c000000007da3d80 TRAP: 0100 Not tainted (4.15.0-15-generic) [ 2020.139970] MSR: 9000000000089033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 48024242 XER: 00000000 [ 2020.139977] CFAR: c0000000001d5370 SOFTE: 1 [ 2020.139977] GPR00: c0000000001d5340 c000200dc242f360 c0000000016eb400 0000000000000000 [ 2020.139977] GPR04: c000200dc242f380 c000001ff74d9600 c000200dc242f490 c000000001721ed8 [ 2020.139977] GPR08: c000000001721ed8 0000000000000001 c000001ff74dc0e0 0000000000000000 [ 2020.139977] GPR12: c000000000acd200 c000000007a3ce00 00007c761aa00000 00007c73e12e0000 [ 2020.139977] GPR16: c000001fde6e6c00 0000000000000100 0000000000000001 0000000000000010 [ 2020.139977] GPR20: c000000001712208 0000000000400040 0000000000000000 c000000001713b00 [ 2020.139977] GPR24: 00000001000682d1 5deadbeef0000200 c000000001a46528 c000001fea7a9df8 [ 2020.139977] GPR28: 0000000000000001 c000200dc242f490 c000000000acc860 c000200dc242f380 [ 2020.140003] NIP [c0000000001d5364] smp_call_function_single+0x134/0x180 [ 2020.140006] LR [c0000000001d5340] smp_call_function_single+0x110/0x180 [ 2020.140006] Call Trace: [ 2020.140009] [c000200dc242f360] [c0000000001d5340] smp_call_function_single+0x110/0x180 (unreliable) [ 2020.140012] [c000200dc242f3d0] [c0000000001d55e0] smp_call_function_any+0x180/0x250 [ 2020.140016] [c000200dc242f430] [c000000000acd3e8] gpstate_timer_handler+0x1e8/0x580 [ 2020.140019] [c000200dc242f4e0] [c0000000001b46b0] call_timer_fn+0x50/0x1c0 [ 2020.140022] [c000200dc242f560] [c0000000001b4958] expire_timers+0x138/0x1f0 [ 2020.140024] [c000200dc242f5d0] [c0000000001b4bf8] run_timer_softirq+0x1e8/0x270 [ 2020.140028] [c000200dc242f670] [c000000000d0d6c8] do_softirq+0x158/0x3e4 [ 2020.140032] [c000200dc242f750] [c000000000114be8] irq_exit+0xe8/0x120 [ 2020.140036] [c000200dc242f770] [c000000000024d0c] timer_interrupt+0x9c/0xe0 [ 2020.140039] [c000200dc242f7a0] [c000000000009014] decrementer_common+0x114/0x120 [ 2020.140043] --- interrupt: 901 at smp_call_function_many+0x2b8/0x450 [ 2020.140043] LR = smp_call_function_many+0x324/0x450 [ 2020.140047] [c000200dc242fb00] [c000000000075f18] pmdp_invalidate+0x98/0xe0 [ 2020.140051] [c000200dc242fb30] [c0000000003a1120] change_huge_pmd+0xe0/0x270 [ 2020.140055] [c000200dc242fba0] [c000000000349278] change_protection_range+0xb88/0xe40 [ 2020.140058] [c000200dc242fcf0] [c0000000003496c0] mprotect_fixup+0x140/0x340 [ 2020.140061] [c000200dc242fdb0] [c000000000349a74] SyS_mprotect+0x1b4/0x350 [ 2020.140064] [c000200dc242fe30] [c00000000000b184] system_call+0x58/0x6c [ 2020.140065] Instruction dump: [ 2020.140066] 7c852378 7fe4fb78 4bfffd4d 813f0018 71290001 4182002c 48000014 60000000 [ 2020.140071] 60000000 60000000 60420000 7c210b78 <7c421378> 813f0018 71290001 4082fff0 [ 2027.903588] INFO: rcu_sched self-detected stall on CPU [ 2027.903590] INFO: rcu_sched self-detected stall on CPU [ 2027.903596] 24-....: (5248 ticks this GP) idle=832/140000000000001/0 softirq=171758/171758 fqs=2625 [ 2027.903597] LOCK ERROR: Unlocked non-owned lock @0x3045fc08 (state: 0x0000003200000001) [ 2274.374397064,0] Aborting! CPU 0032 Backtrace: S: 0000000031ccb770 R: 000000003001362c .backtrace+0x48 S: 0000000031ccb810 R: 000000003001a0fc ._abort+0x4c S: 0000000031ccb890 R: 0000000030017ae0 .lock_error+0x64 S: 0000000031ccb910 R: 0000000030017610 .unlock+0x60 S: 0000000031ccb980 R: 0000000030036fc0 .lpc_write+0x84 S: 0000000031ccba30 R: 00000000300387b4 .uart_write+0x4c S: 0000000031ccbaa0 R: 0000000030038a40 .uart_con_flush+0xd0 S: 0000000031ccbb30 R: 0000000030039160 .__uart_do_poll+0x4c S: 0000000031ccbc20 R: 000000003001b600 .opal_run_pollers+0x148 S: 0000000031ccbca0 R: 000000003001b68c .opal_poll_events+0x74 S: 0000000031ccbd20 R: 000000003000515c opal_entry+0xac S: 0000000031ccbf00 R: 0000000030002788 secondary_wait+0x8c [ 2274.375983052,6] BT: seq 0x94 netfn 0x0a cmd 0x42: Message sent to host [ 2274.379494165,3] OPAL exiting with locks held, token=145 retval=0 [ 2274.380136507,3] core/lock.c:216 [ 2274.380154420,3] core/lock.c:216 [ 2274.390978370,5] Unable to log error [ 2274.427185749,0] Assert fail: core/mem_region.c:447:lock_held_by_me(&region->free_list_lock) [ 2283.366568213,0] Assert fail: core/mem_region.c:447:lock_held_by_me(&region->free_list_lock)

ghost commented 6 years ago

shriyak notifications@github.com writes:

Hitting Hardlock up issue on Witherspoon machine (DD2.2) with 1808D + stop4 enabled using attribute override.

What code is in 1808D? Pretty much unless everything is with absolutely today's op-build, there's known problems with stop4 and above.

-- Stewart Smith OPAL Architect, IBM.

mzipse commented 6 years ago

skiboot level in 1808D is skiboot-v5.10-55-g603beb4500f5

svaidy commented 6 years ago

What is this bumblebeed.service? This fails/restarts in both this wsp case and in boston case? Can it hold up CPUs in OPAL?

I have updated observations in https://github.com/open-power/boston-openpower/issues/1180.

npiggin commented 6 years ago

This is a deadlock due to a synchronous smp_call_function being called from the timer interrupt:

[12532.259260] [c000003fe566b320] [c0000000001d5340] smp_call_function_single+0x110/0x180 (unreliable) [12532.259263] [c000003fe566b390] [c0000000001d55e0] smp_call_function_any+0x180/0x250 [12532.259265] [c000003fe566b3f0] [c000000000acd3e8] gpstate_timer_handler+0x1e8/0x580 [12532.259268] [c000003fe566b4a0] [c0000000001b46b0] call_timer_fn+0x50/0x1c0 [12532.259271] [c000003fe566b520] [c0000000001b4958] expire_timers+0x138/0x1f0 [12532.259274] [c000003fe566b590] [c0000000001b4bf8] run_timer_softirq+0x1e8/0x270 [12532.259277] [c000003fe566b630] [c000000000d0d6c8] __do_softirq+0x158/0x3e4 [12532.259280] [c000003fe566b710] [c000000000114be8] irq_exit+0xe8/0x120 [12532.259283] [c000003fe566b730] [c000000000024d0c] timer_interrupt+0x9c/0xe0 [12532.259286] [c000003fe566b760] [c000000000009014] decrementer_common+0x114/0x120 [12532.259290] --- interrupt: 901 at doorbell_global_ipi+0x34/0x50 [12532.259290] LR = arch_send_call_function_ipi_mask+0x120/0x130 [12532.259292] [c000003fe566ba50] [c00000000004876c] arch_send_call_function_ipi_mask+0x4c/0x130 (unreliable) [12532.259295] [c000003fe566ba90] [c0000000001d59f0] smp_call_function_many+0x340/0x450 [12532.259299] [c000003fe566bb00] [c000000000075f18] pmdp_invalidate+0x98/0xe0 [12532.259302] [c000003fe566bb30] [c0000000003a1120] change_huge_pmd+0xe0/0x270 [12532.259305] [c000003fe566bba0] [c000000000349278] change_protection_range+0xb88/0xe40 [12532.259308] [c000003fe566bcf0] [c0000000003496c0] mprotect_fixup+0x140/0x340 [12532.259311] [c000003fe566bdb0] [c000000000349a74] SyS_mprotect+0x1b4/0x350 [12532.259314] [c000003fe566be30] [c00000000000b184] system_call+0x58/0x6c

svaidy commented 6 years ago

Fix upstream in 4.17-rc3

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v4.17-rc3&id=c0f7f5b6c69107ca92909512533e70258ee19188 cpufreq: powernv: Fix hardlockup due to synchronous smp_call in timer interrupt

Posted to stable as well.

ghost commented 6 years ago

Vaidyanathan Srinivasan notifications@github.com writes:

Fix upstream in 4.17-rc3

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v4.17-rc3&id=c0f7f5b6c69107ca92909512533e70258ee19188 cpufreq: powernv: Fix hardlockup due to synchronous smp_call in timer interrupt

Posted to stable as well.

https://github.com/open-power/op-build/pull/2083 will bring it into op-build -- Stewart Smith OPAL Architect, IBM.