Closed sathnaga closed 6 years ago
------- Comment From chavez@us.ibm.com 2017-12-08 12:48:35 EDT------- Which firmware version is installed?
------- Comment From seg@us.ibm.com 2018-02-15 00:39:50 EDT------- OK the firmware is there. Please test.
Used latest FW and able to reproduce the issue
ENV:
FW: fips910/b0215b_1808.910
Host Kernel: 4.15.0-5.dev.git33f711f.el7.centos.ppc64le
lscpu
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Thread(s) per core: 4
Core(s) per socket: 16
Socket(s): 2
NUMA node(s): 2
Model: 2.0 (pvr 004e 1200)
Model name: POWER9 (raw), altivec supported
CPU max MHz: 3800.0000
CPU min MHz: 2300.0000
Hypervisor vendor: (null)
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 512K
L3 cache: 10240K
NUMA node0 CPU(s): 0-63
NUMA node8 CPU(s): 64-127
# cat /proc/cmdline
root=/dev/mapper/host_os_ltczzj2-root ro console=tty0 rd.lvm.lv=host_os_ltczzj2/root rd.lvm.lv=host_os_ltczzj2/swap LANG=en_US.UTF-8
Steps to reproduce: Just run any avocado guest tests.
python avocado-setup.py --run-suite guest_cpu --guest-os HostOS.ppc64le --no-download
00:22:56 INFO : Check for environment
00:22:56 INFO : Creating temporary mux dir
00:23:06 INFO :
00:23:06 INFO : Running Guest Tests Suite cpu
00:23:06 INFO : Running: /usr/bin/avocado run --vt-type libvirt --vt-config /home/sath/avocado-fvt-wrapper/data/avocado-vt/backends/libvirt/cfg/cpu.cfg --force-job-id d40eec16791f17cf5a0b3f0de03d2ab04da7a2d3 --vt-only-filter "HostOS.ppc64le" --job-results-dir /home/sath/avocado-fvt-wrapper/results
JOB ID : d40eec16791f17cf5a0b3f0de03d2ab04da7a2d3
JOB LOG : /home/sath/avocado-fvt-wrapper/results/job-2018-02-19T00.23-d40eec1/job.log
(001/853) guest_import.qemu.qcow2.virtio_scsi.smp2.virtio_net.Guest.HostOS.ppc64le.powerkvm-qemu.unattended_install.import.import.default_install.aio_native: -
Message from syslogd@ltczzj2 at Feb 19 00:24:14 ...
kernel:Watchdog CPU:97 detected Hard LOCKUP other CPUS:38
dmesg:
[ 186.241993] virbr0: port 2(vnet0) entered disabled state
[ 200.167913] Watchdog CPU:97 detected Hard LOCKUP other CPUS:38
[ 246.439590] INFO: rcu_sched self-detected stall on CPU
[ 246.439710] 74-....: (6000 ticks this GP) idle=6d2/140000000000001/0 softirq=786/786 fqs=2889
[ 246.439795] (t=6000 jiffies g=2031 c=2030 q=30958)
[ 246.439852] NMI backtrace for cpu 74
[ 246.439891] CPU: 74 PID: 6855 Comm: worker Not tainted 4.15.0-5.dev.git33f711f.el7.centos.ppc64le #1
[ 246.439976] Call Trace:
[ 246.440007] [c000203835602ef0] [c000000000b0c19c] dump_stack+0xb0/0xf4 (unreliable)
[ 246.440082] [c000203835602f30] [c000000000b153c4] nmi_cpu_backtrace+0x1a4/0x210
[ 246.440156] [c000203835602fc0] [c000000000b1560c] nmi_trigger_cpumask_backtrace+0x1dc/0x220
[ 246.440232] [c000203835603060] [c00000000002f448] arch_trigger_cpumask_backtrace+0x28/0x40
[ 246.440307] [c000203835603080] [c000000000197a6c] rcu_dump_cpu_stacks+0xfc/0x160
[ 246.440380] [c0002038356030d0] [c000000000196c90] rcu_check_callbacks+0x8e0/0xaf0
[ 246.440454] [c000203835603200] [c0000000001a3874] update_process_times+0x44/0x90
[ 246.440529] [c000203835603230] [c0000000001bac8c] tick_sched_handle.isra.13+0x4c/0x80
[ 246.440602] [c000203835603250] [c0000000001bad20] tick_sched_timer+0x60/0xc0
[ 246.440675] [c000203835603290] [c0000000001a44a8] __hrtimer_run_queues+0xf8/0x330
[ 246.440749] [c000203835603310] [c0000000001a524c] hrtimer_interrupt+0xec/0x290
[ 246.440823] [c0002038356033d0] [c000000000024438] __timer_interrupt+0x98/0x280
[ 246.440897] [c000203835603420] [c000000000024b58] timer_interrupt+0xa8/0xe0
[ 246.440959] [c000203835603450] [c0000000000092a8] decrementer_common+0x158/0x160
[ 246.441034] --- interrupt: 901 at smp_call_function_many+0x388/0x420
LR = smp_call_function_many+0x328/0x420
[ 246.441145] [c0002038356037b0] [c000000000072328] serialize_against_pte_lookup+0x38/0x50
[ 246.441219] [c0002038356037d0] [c0000000000744d0] radix__pmdp_huge_get_and_clear+0x60/0x80
[ 246.441294] [c000203835603800] [c00000000034598c] zap_huge_pmd+0x8c/0x540
[ 246.441356] [c0002038356038a0] [c0000000002df120] unmap_page_range+0xe80/0x10a0
[ 246.441430] [c0002038356039e0] [c0000000002df834] unmap_vmas+0x84/0x100
[ 246.441491] [c000203835603a30] [c0000000002edf1c] exit_mmap+0x9c/0x1c0
[ 246.441553] [c000203835603af0] [c0000000000ffd28] mmput+0xb8/0x1f0
[ 246.441614] [c000203835603b20] [c00000000010a1c8] do_exit+0x358/0xcd0
[ 246.441676] [c000203835603be0] [c00000000010ac14] do_group_exit+0x64/0x100
[ 246.441738] [c000203835603c20] [c00000000011aba0] get_signal+0x210/0x700
[ 246.441799] [c000203835603d10] [c00000000001cb00] do_signal+0x80/0x2e0
[ 246.441861] [c000203835603e00] [c00000000001cf14] do_notify_resume+0xd4/0x100
[ 246.441935] [c000203835603e30] [c00000000000bf44] ret_from_except_lite+0x70/0x74
[ 426.487634] INFO: rcu_sched self-detected stall on CPU
[ 426.487722] 74-....: (24004 ticks this GP) idle=6d2/140000000000001/0 softirq=786/786 fqs=11606
[ 426.487733] (t=24004 jiffies g=2031 c=2030 q=119140)
[ 426.487791] NMI backtrace for cpu 74
[ 426.487830] CPU: 74 PID: 6855 Comm: worker Not tainted 4.15.0-5.dev.git33f711f.el7.centos.ppc64le #1
[ 426.487915] Call Trace:
P:S: with powersave=off
kernel commandline no issues seen.
------- Comment From seg@us.ibm.com 2018-02-19 08:52:54 EDT------- OPAL team, please have a look.
==== State: Rejected by: dcrowell on 19 February 2018 13:18:55 ====
I have tried to read the pointed-at bug and I can't make any sense of it so I have no idea what the fix is that is being referred to. It looks like there have been several iterations of different problems described. Please summarize what the expected fix is and reopen. Or better yet, just dupe this directly.
------- Comment From satheera@in.ibm.com 2018-02-27 09:57:06 EDT------- I am able to reproduce the similar issue with DD2.2 aswell with the latest FW with powersave enabled and with powersave disabled(i.e powersave=off kernel cmdline) this issue is not seen.
Host Env $cupdcmd -f ...
Current Side Driver:.....fips910/b0223b_1810.910 Non-Current Side Driver:.fips910/b0215b_1808.910
Model: 2.2 (pvr 004e 1202) Model name: POWER9 (raw), altivec supported
Host kernel: 4.15.0-5.dev.git33f711f.el7.centos.ppc64le
kernel cmdline: root=/dev/mapper/host_os_9-root ro console=tty0 rd.lvm.lv=host_os_9/root rd.lvm.lv=host_os_9/swap LANG=en_US.UTF-8
Steps to reproduce.
FSP Event/error logs has this entry then, 5302D258 2018-02-27 14:17:04 CEC Hardware Subsystem Unrecoverable Error BB82C013 ?
Platform Event Log - 5302D258 Created at : 02/27/2018 14:17:03 Subsystem : CEC Hardware Subsystem Event Severity : Unrecoverable Error Action Flags : Report Externally Action Status : Reported to Opr Sys
Primary System Reference Code
Reference Code : BB82C013 Hex Words 2 - 5 : 00000080 00000000 00000000 00000000 Hex Words 6 - 9 : 00000000 00000000 00000000 00000000
Log Hex Dump
00000000 50480030 01004F43 20180227 14170300 PH.0..OC ..'....
00000010 20180227 14170430 4B000106 00000000 ..'...0K.......
00000020 00000000 00000000 B0000002 5302D258 ............S..X
00000030 55480018 01004F43 50004000 00000000 UH....OCP.@.....
00000040 00002000 01004500 50530050 01004F43 .. ...E.PS.P..OC
00000050 02000008 00000048 00000080 00000000 .......H........
00000060 00000000 00000000 00000000 00000000 ................
00000070 00000000 00000000 42423832 43303133 ........BB82C013
00000080 20202020 20202020 20202020 20202020
00000090 20202020 20202020 4548004C 01004F43 EH.L..OC
000000A0 39303038 2D32324C 31333642 45444100 9008-22L136BEDA.
000000B0 00000000 00000000 00000000 00000000 ................
000000C0 00000000 00000000 00000000 00000000 ................
000000D0 00000000 00000000 20180227 14170300 ........ ..'....
000000E0 00000000 4D54001C 01004F43 39303038 ....MT....OC9008
000000F0 2D32324C 31333642 45444100 00000000 -22L136BEDA.....
00000100 5544004D 01004F43 44455343 00450000 UD.M..OCDESC.E..
00000110 4F43433A 20496E69 7469616C 697A6174 OCC: Initializat
00000120 696F6E20 6F6E2061 6C6C2063 68697073 ion on all chips
00000130 20646964 206E6F74 20636F6D 706C6574 did not complet
00000140 65287469 6D656420 6F757429 0A e(timed out).
Regards, -Satheesh
------- Comment From satheera@in.ibm.com 2018-02-27 10:08:25 EDT------- I just now noticed the issue after a few test runs inside host even with powersave disabled aswell. During the issue occurred, host was idle though, not running any tests/load.
FSP log event/errors histroy: Log ID Time Failing subsystem Severity SRC Reported 5302D602 2018-02-27 15:17:40 CEC Hardware Subsystem Unrecoverable Error BB82C013 ? ------------------------------------------------------------------------------------------------------------------->powersave=off 5002D4A0 2018-02-27 15:11:42 Service Processor Firmware Unrecoverable Error B1813450 ? 5002D295 2018-02-27 14:21:32 OCC Unrecoverable Error, Degraded Performance B18B2616 ? 5302D258 2018-02-27 14:17:04 CEC Hardware Subsystem Unrecoverable Error BB82C013 ? 5302D218 2018-02-27 14:12:48 CEC Hardware Subsystem Unrecoverable Error BB82C013 ? 5002D0A4 2018-02-27 14:07:06 Service Processor Firmware Unrecoverable Error B1813450 ? 5002CE92 2018-02-27 13:16:56 OCC Unrecoverable Error, Degraded Performance B18B2616 ?
Regards, -Satheesh
------- Comment From npiggin@au1.ibm.com 2018-02-27 19:27:03 EDT------- Are you getting the same kernel logs with powersave=off? Or is this a new bug?
If the kernel messages include "smp_call_function_many", then we need to get traces from all CPUs. echo 1 > /proc/sys/kernel/hardlockup_panic
should crash the system when you get a lockup and give traces.
------- Comment From satheera@in.ibm.com 2018-02-28 06:11:34 EDT-------
(In reply to comment #16)
> Are you getting the same kernel logs with powersave=off? Or is this a new
> bug?
>
> If the kernel messages include "smp_call_function_many", then we need to get
> traces from all CPUs. echo 1 > /proc/sys/kernel/hardlockup_panic
should
> crash the system when you get a lockup and give traces.
Looks like this is different issue, am not completely sure though, host does not have any traces at all, it dies suddenly and FSP error log has the below issue reported, looked to checkstop?? , I just wanted reuse the FW bug which got mirrored already,
FSP Event/error logs has this entry then, 5302D258 2018-02-27 14:17:04 CEC Hardware Subsystem Unrecoverable Error BB82C013 ?
Platform Event Log - 5302D258 Created at : 02/27/2018 14:17:03 Subsystem : CEC Hardware Subsystem Event Severity : Unrecoverable Error Action Flags : Report Externally Action Status : Reported to Opr Sys
Regards, -Satheesh.
------- Comment From bssrikanth@in.ibm.com 2018-03-06 02:27:12 EDT------- This bug is gating us from using the ZZ small core DD 2.2 for KVM testing. Please help resolve asap.
------- Comment From satheera@in.ibm.com 2018-03-08 03:54:01 EDT-------
Am able to reproduce the same(In reply to comment #16)
> Are you getting the same kernel logs with powersave=off? Or is this a new
> bug?
>
> If the kernel messages include "smp_call_function_many", then we need to get
> traces from all CPUs. echo 1 > /proc/sys/kernel/hardlockup_panic
should
> crash the system when you get a lockup and give traces.
yes, looks like new bug, I will close this issue and open a new one for the issue, even the latest FW reproduces it.
Regards, -Satheesh
------- Comment From satheera@in.ibm.com 2018-03-08 03:55:46 EDT------- Initial issue reported in the bz is not seen.