Open pridhiviraj opened 6 years ago
@stewart-ibm Used below tests reproduce this issue, it is occurring on more systems now.
tlbie_test
============
two instances
CPU hotplug tests;
=========================
cat cpu_hotplug.sh
#!/bin/bash
set -x
for ((i=0; i<100000; i++))
do
ppc64_cpu --cores-on=1
ppc64_cpu --cores-on=32
done
cpu frequency read tests:
========================
cat read_frequency.sh
#!/bin/bash
set -x
for ((i=0; i<100000; i++))
do
ppc64_cpu --frequency
done
CPU Governor change test:
===========================
cat cpu_freq_gov1.sh
#!/bin/bash
set -x
CPU_SYS_DIR=/sys/devices/system/cpu/
FIRST_AVAIL_CPU=$(cat /sys/devices/system/cpu/present | cut -d'-' -f1)
res=$(cat /sys/devices/system/cpu/cpu$FIRST_AVAIL_CPU/cpufreq/scaling_available_governors)
IFS=' ' read -r -a array <<< "$res"
for ((i=0; i<1000; i++))
do
for gov in "${array[@]}"
do
for j in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo $gov > $j; done
done
done
Hitting Hardlock up issue on Witherspoon machine (DD2.2) with 1808D + stop4 enabled using attribute override.
OS : Ubuntu 1804 Kernel : 4.15.0-15-generic
Call trace
[ 2020.139747] Watchdog CPU:71 detected Hard LOCKUP other CPUS:42 [ 2020.139931] Watchdog CPU:42 Hard LOCKUP [ 2020.139933] Modules linked in: ibmpowernv vmx_crypto ofpart idt_89hpesx ipmi_powernv at24 cmdlinepart uio_pdrv_genirq uio opal_prd crct10dif_vpmsum powernv_flash mtd ipmi_devintf ipmi_msghandler nfsd auth_rpcgss nfs_acl lockd grace sch_fq_codel sunrpc ip_tables x_tables autofs4 ast i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ahci crc32c_vpmsum drm tg3 libahci [ 2020.139965] CPU: 42 PID: 3391 Comm: a.out Not tainted 4.15.0-15-generic #16-Ubuntu [ 2020.139967] NIP: c0000000001d5364 LR: c0000000001d5340 CTR: c000000000acd200 [ 2020.139969] REGS: c000000007da3d80 TRAP: 0100 Not tainted (4.15.0-15-generic) [ 2020.139970] MSR: 9000000000089033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 48024242 XER: 00000000 [ 2020.139977] CFAR: c0000000001d5370 SOFTE: 1 [ 2020.139977] GPR00: c0000000001d5340 c000200dc242f360 c0000000016eb400 0000000000000000 [ 2020.139977] GPR04: c000200dc242f380 c000001ff74d9600 c000200dc242f490 c000000001721ed8 [ 2020.139977] GPR08: c000000001721ed8 0000000000000001 c000001ff74dc0e0 0000000000000000 [ 2020.139977] GPR12: c000000000acd200 c000000007a3ce00 00007c761aa00000 00007c73e12e0000 [ 2020.139977] GPR16: c000001fde6e6c00 0000000000000100 0000000000000001 0000000000000010 [ 2020.139977] GPR20: c000000001712208 0000000000400040 0000000000000000 c000000001713b00 [ 2020.139977] GPR24: 00000001000682d1 5deadbeef0000200 c000000001a46528 c000001fea7a9df8 [ 2020.139977] GPR28: 0000000000000001 c000200dc242f490 c000000000acc860 c000200dc242f380 [ 2020.140003] NIP [c0000000001d5364] smp_call_function_single+0x134/0x180 [ 2020.140006] LR [c0000000001d5340] smp_call_function_single+0x110/0x180 [ 2020.140006] Call Trace: [ 2020.140009] [c000200dc242f360] [c0000000001d5340] smp_call_function_single+0x110/0x180 (unreliable) [ 2020.140012] [c000200dc242f3d0] [c0000000001d55e0] smp_call_function_any+0x180/0x250 [ 2020.140016] [c000200dc242f430] [c000000000acd3e8] gpstate_timer_handler+0x1e8/0x580 [ 2020.140019] [c000200dc242f4e0] [c0000000001b46b0] call_timer_fn+0x50/0x1c0 [ 2020.140022] [c000200dc242f560] [c0000000001b4958] expire_timers+0x138/0x1f0 [ 2020.140024] [c000200dc242f5d0] [c0000000001b4bf8] run_timer_softirq+0x1e8/0x270 [ 2020.140028] [c000200dc242f670] [c000000000d0d6c8] do_softirq+0x158/0x3e4 [ 2020.140032] [c000200dc242f750] [c000000000114be8] irq_exit+0xe8/0x120 [ 2020.140036] [c000200dc242f770] [c000000000024d0c] timer_interrupt+0x9c/0xe0 [ 2020.140039] [c000200dc242f7a0] [c000000000009014] decrementer_common+0x114/0x120 [ 2020.140043] --- interrupt: 901 at smp_call_function_many+0x2b8/0x450 [ 2020.140043] LR = smp_call_function_many+0x324/0x450 [ 2020.140047] [c000200dc242fb00] [c000000000075f18] pmdp_invalidate+0x98/0xe0 [ 2020.140051] [c000200dc242fb30] [c0000000003a1120] change_huge_pmd+0xe0/0x270 [ 2020.140055] [c000200dc242fba0] [c000000000349278] change_protection_range+0xb88/0xe40 [ 2020.140058] [c000200dc242fcf0] [c0000000003496c0] mprotect_fixup+0x140/0x340 [ 2020.140061] [c000200dc242fdb0] [c000000000349a74] SyS_mprotect+0x1b4/0x350 [ 2020.140064] [c000200dc242fe30] [c00000000000b184] system_call+0x58/0x6c [ 2020.140065] Instruction dump: [ 2020.140066] 7c852378 7fe4fb78 4bfffd4d 813f0018 71290001 4182002c 48000014 60000000 [ 2020.140071] 60000000 60000000 60420000 7c210b78 <7c421378> 813f0018 71290001 4082fff0 [ 2027.903588] INFO: rcu_sched self-detected stall on CPU [ 2027.903590] INFO: rcu_sched self-detected stall on CPU [ 2027.903596] 24-....: (5248 ticks this GP) idle=832/140000000000001/0 softirq=171758/171758 fqs=2625 [ 2027.903597] LOCK ERROR: Unlocked non-owned lock @0x3045fc08 (state: 0x0000003200000001) [ 2274.374397064,0] Aborting! CPU 0032 Backtrace: S: 0000000031ccb770 R: 000000003001362c .backtrace+0x48 S: 0000000031ccb810 R: 000000003001a0fc ._abort+0x4c S: 0000000031ccb890 R: 0000000030017ae0 .lock_error+0x64 S: 0000000031ccb910 R: 0000000030017610 .unlock+0x60 S: 0000000031ccb980 R: 0000000030036fc0 .lpc_write+0x84 S: 0000000031ccba30 R: 00000000300387b4 .uart_write+0x4c S: 0000000031ccbaa0 R: 0000000030038a40 .uart_con_flush+0xd0 S: 0000000031ccbb30 R: 0000000030039160 .__uart_do_poll+0x4c S: 0000000031ccbc20 R: 000000003001b600 .opal_run_pollers+0x148 S: 0000000031ccbca0 R: 000000003001b68c .opal_poll_events+0x74 S: 0000000031ccbd20 R: 000000003000515c opal_entry+0xac S: 0000000031ccbf00 R: 0000000030002788 secondary_wait+0x8c [ 2274.375983052,6] BT: seq 0x94 netfn 0x0a cmd 0x42: Message sent to host [ 2274.379494165,3] OPAL exiting with locks held, token=145 retval=0 [ 2274.380136507,3] core/lock.c:216 [ 2274.380154420,3] core/lock.c:216 [ 2274.390978370,5] Unable to log error [ 2274.427185749,0] Assert fail: core/mem_region.c:447:lock_held_by_me(®ion->free_list_lock) [ 2283.366568213,0] Assert fail: core/mem_region.c:447:lock_held_by_me(®ion->free_list_lock)
shriyak notifications@github.com writes:
Hitting Hardlock up issue on Witherspoon machine (DD2.2) with 1808D + stop4 enabled using attribute override.
What code is in 1808D? Pretty much unless everything is with absolutely today's op-build, there's known problems with stop4 and above.
-- Stewart Smith OPAL Architect, IBM.
skiboot level in 1808D is skiboot-v5.10-55-g603beb4500f5
What is this bumblebeed.service? This fails/restarts in both this wsp case and in boston case? Can it hold up CPUs in OPAL?
I have updated observations in https://github.com/open-power/boston-openpower/issues/1180.
This is a deadlock due to a synchronous smp_call_function being called from the timer interrupt:
[12532.259260] [c000003fe566b320] [c0000000001d5340] smp_call_function_single+0x110/0x180 (unreliable) [12532.259263] [c000003fe566b390] [c0000000001d55e0] smp_call_function_any+0x180/0x250 [12532.259265] [c000003fe566b3f0] [c000000000acd3e8] gpstate_timer_handler+0x1e8/0x580 [12532.259268] [c000003fe566b4a0] [c0000000001b46b0] call_timer_fn+0x50/0x1c0 [12532.259271] [c000003fe566b520] [c0000000001b4958] expire_timers+0x138/0x1f0 [12532.259274] [c000003fe566b590] [c0000000001b4bf8] run_timer_softirq+0x1e8/0x270 [12532.259277] [c000003fe566b630] [c000000000d0d6c8] __do_softirq+0x158/0x3e4 [12532.259280] [c000003fe566b710] [c000000000114be8] irq_exit+0xe8/0x120 [12532.259283] [c000003fe566b730] [c000000000024d0c] timer_interrupt+0x9c/0xe0 [12532.259286] [c000003fe566b760] [c000000000009014] decrementer_common+0x114/0x120 [12532.259290] --- interrupt: 901 at doorbell_global_ipi+0x34/0x50 [12532.259290] LR = arch_send_call_function_ipi_mask+0x120/0x130 [12532.259292] [c000003fe566ba50] [c00000000004876c] arch_send_call_function_ipi_mask+0x4c/0x130 (unreliable) [12532.259295] [c000003fe566ba90] [c0000000001d59f0] smp_call_function_many+0x340/0x450 [12532.259299] [c000003fe566bb00] [c000000000075f18] pmdp_invalidate+0x98/0xe0 [12532.259302] [c000003fe566bb30] [c0000000003a1120] change_huge_pmd+0xe0/0x270 [12532.259305] [c000003fe566bba0] [c000000000349278] change_protection_range+0xb88/0xe40 [12532.259308] [c000003fe566bcf0] [c0000000003496c0] mprotect_fixup+0x140/0x340 [12532.259311] [c000003fe566bdb0] [c000000000349a74] SyS_mprotect+0x1b4/0x350 [12532.259314] [c000003fe566be30] [c00000000000b184] system_call+0x58/0x6c
Fix upstream in 4.17-rc3
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v4.17-rc3&id=c0f7f5b6c69107ca92909512533e70258ee19188 cpufreq: powernv: Fix hardlockup due to synchronous smp_call in timer interrupt
Posted to stable as well.
Vaidyanathan Srinivasan notifications@github.com writes:
Fix upstream in 4.17-rc3
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v4.17-rc3&id=c0f7f5b6c69107ca92909512533e70258ee19188 cpufreq: powernv: Fix hardlockup due to synchronous smp_call in timer interrupt
Posted to stable as well.
https://github.com/open-power/op-build/pull/2083 will bring it into op-build -- Stewart Smith OPAL Architect, IBM.
We have long long ago bug still existed even in latest upstream PNOR as well as in official released PNOR's. @stewart-ibm This issue occurring in both your witherspoon system which is having latest code, and also got system from vaidy with released PNOR, which is also hitting the issue.
PNOR Level: