Hardlockup during a VM boot on zz with D2.0 with latest devel branch

sathnaga commented 6 years ago

Mirrored with LTC bug https://bugzilla.linux.ibm.com/show_bug.cgi?id=162352 Hitting hardlockup with latest hostos devel branch while it was booting VM(hostos guest and ran `yum update` test inside the guest, I guess test irrespective of test, guest boot caused it.) kernel: 4.14.0-3.dev.git68b4afb.el7.centos.ppc64le ``` Machine used: zz - DD2.0 # lscpu Architecture: ppc64le Byte Order: Little Endian CPU(s): 128 On-line CPU(s) list: 0-127 Thread(s) per core: 4 Core(s) per socket: 16 Socket(s): 2 NUMA node(s): 2 Model: 2.0 (pvr 004e 1200) Model name: POWER9 (raw), altivec supported CPU max MHz: 3800.0000 CPU min MHz: 2283.0000 Hypervisor vendor: (null) Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 512K L3 cache: 10240K NUMA node0 CPU(s): 0-63 NUMA node8 CPU(s): 64-127 FW: fips910/b1117d_1748.910 ``` JOB LOG : /home/workspace/runAvocadoFVTTest/avocado-fvt-wrapper/results/job-2017-12-08T08.18-fe6f991/job.log (1/5) guest_import.qemu.qcow2.virtio_scsi.smp2.virtio_net.HostOS.ppc64le.powerkvm-qemu.unattended_install.import.import.default_install.aio_native: PASS (47.01 s) (2/5) guest_update.yum.qemu.qcow2.virtio_scsi.smp2.virtio_net.HostOS.ppc64le.powerkvm-qemu.guest_test.isa_serial_operations: PASS (58.74 s) ... ``` [ 7475.715366] virbr0: topology change detected, propagating [ 7553.518889] virbr0: port 2(vnet0) entered disabled state [ 7553.525649] device vnet0 left promiscuous mode [ 7553.525718] virbr0: port 2(vnet0) entered disabled state [ 7569.408657] Watchdog CPU:32 detected Hard LOCKUP other CPUS:117 Message from syslogd@ltczzj2 at Dec 8 08:21:19 ... kernel:Watchdog CPU:32 detected Hard LOCKUP other CPUS:117 [ 7584.519034] Watchdog CPU:94 detected Hard LOCKUP other CPUS:44 [ 7584.519142] Watchdog CPU:44 Hard LOCKUP [ 7584.519206] Modules linked in: vhost_net vhost tap binfmt_misc xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack ip_set nfnetlink rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ebtable_nat ebtable_broute ib_ipoib rdma_ucm ib_ucm ib_uverbs bridge ib_umad stp llc rdma_cm ib_cm iw_cm ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat mlx5_ib ib_core nf_conntrack iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables ses enclosure scsi_transport_sas ipmi_powernv ipmi_devintf powernv_op_panel ipmi_msghandler opal_prd nfsd auth_rpcgss oid_registry [ 7584.519872] nfs_acl lockd grace sunrpc kvm_hv kvm lpfc mlx5_core bnx2x scsi_transport_fc mdio libcrc32c ptp pps_core [ 7584.519983] CPU: 44 PID: 35342 Comm: CPU 30/KVM Not tainted 4.14.0-3.dev.git68b4afb.el7.centos.ppc64le #1 [ 7584.520072] task: c000003f7da1ce00 task.stack: c000003f7dad4000 [ 7584.520135] NIP: c0000000001b9464 LR: c0000000001b9408 CTR: c0000000000420e0 [ 7584.520209] REGS: c000003f7dad74c0 TRAP: 0e81 Not tainted (4.14.0-3.dev.git68b4afb.el7.centos.ppc64le) [ 7584.520296] MSR: 9000000000009033 CR: 44224484 XER: 20040000 [ 7584.520378] CFAR: c0000000001b9470 SOFTE: 1 [ 7584.520378] GPR00: c0000000001b9408 c000003f7dad7740 c000000001403900 0000000000000075 [ 7584.520378] GPR04: 0000000000000400 0000000000000040 0000000000000000 c000000001439d78 [ 7584.520378] GPR08: 0000000000000001 0000000000000003 c000203994b48da0 0000000000000010 [ 7584.520378] GPR12: c0000000000420e0 c00000000fd7ce00 [ 7584.520689] NIP [c0000000001b9464] smp_call_function_many+0x384/0x420 [ 7584.520753] LR [c0000000001b9408] smp_call_function_many+0x328/0x420 [ 7584.520816] Call Trace: [ 7584.520844] [c000003f7dad7740] [c0000000001b9408] smp_call_function_many+0x328/0x420 (unreliable) [ 7584.520934] [c000003f7dad77b0] [c0000000000711c8] serialize_against_pte_lookup+0x38/0x50 [ 7584.521011] [c000003f7dad77d0] [c000000000073360] radix__pmdp_huge_get_and_clear+0x60/0x80 [ 7584.521088] [c000003f7dad7800] [c00000000033ae2c] zap_huge_pmd+0x8c/0x510 [ 7584.521153] [c000003f7dad78a0] [c0000000002d47f0] unmap_page_range+0xe80/0x10a0 [ 7584.521229] [c000003f7dad79e0] [c0000000002d4f04] unmap_vmas+0x84/0xf0 [ 7584.521293] [c000003f7dad7a30] [c0000000002e3508] exit_mmap+0xe8/0x1f0 [ 7584.521376] [c000003f7dad7af0] [c0000000000fa9f8] mmput+0xb8/0x1f0 [ 7584.521494] [c000003f7dad7b20] [c000000000104cf8] do_exit+0x358/0xcd0 [ 7584.521609] [c000003f7dad7be0] [c000000000105744] do_group_exit+0x64/0x100 [ 7584.521725] [c000003f7dad7c20] [c0000000001155d0] get_signal+0x210/0x700 [ 7584.521841] [c000003f7dad7d10] [c00000000001c65c] do_signal+0x6c/0x2d0 [ 7584.521957] [c000003f7dad7e00] [c00000000001ca74] do_notify_resume+0xd4/0x100 [ 7584.522098] [c000003f7dad7e30] [c00000000000bec4] ret_from_except_lite+0x70/0x74 [ 7584.522238] Instruction dump: [ 7584.522310] 7d495214 812a0018 792807e1 41820034 4800001c 60000000 60000000 60000000 [ 7584.522451] 60000000 60000000 60420000 7c210b78 <7c421378> 812a0018 792807e1 4082fff0 Message from syslogd@ltczzj2 at Dec 8 08:21:33 ... kernel:Watchdog CPU:94 detected Hard LOCKUP other CPUS:44 Message from syslogd@ltczzj2 at Dec 8 08:21:33 ... kernel:Watchdog CPU:44 Hard LOCKUP ```

cdeadmin commented 6 years ago

------- Comment From chavez@us.ibm.com 2017-12-08 12:48:35 EDT------- Which firmware version is installed?

cdeadmin commented 6 years ago

------- Comment From seg@us.ibm.com 2018-02-15 00:39:50 EDT------- OK the firmware is there. Please test.

sathnaga commented 6 years ago

Used latest FW and able to reproduce the issue

ENV:

FW: fips910/b0215b_1808.910

Host Kernel: 4.15.0-5.dev.git33f711f.el7.centos.ppc64le
lscpu
Architecture:          ppc64le
Byte Order:            Little Endian
CPU(s):                128
On-line CPU(s) list:   0-127
Thread(s) per core:    4
Core(s) per socket:    16
Socket(s):             2
NUMA node(s):          2
Model:                 2.0 (pvr 004e 1200)
Model name:            POWER9 (raw), altivec supported
CPU max MHz:           3800.0000
CPU min MHz:           2300.0000
Hypervisor vendor:     (null)
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              512K
L3 cache:              10240K
NUMA node0 CPU(s):     0-63
NUMA node8 CPU(s):     64-127

# cat /proc/cmdline 
root=/dev/mapper/host_os_ltczzj2-root ro console=tty0 rd.lvm.lv=host_os_ltczzj2/root rd.lvm.lv=host_os_ltczzj2/swap LANG=en_US.UTF-8

Steps to reproduce: Just run any avocado guest tests.

python avocado-setup.py  --run-suite guest_cpu --guest-os HostOS.ppc64le --no-download
00:22:56 INFO    : Check for environment
00:22:56 INFO    : Creating temporary mux dir
00:23:06 INFO    : 
00:23:06 INFO    : Running Guest Tests Suite cpu
00:23:06 INFO    : Running: /usr/bin/avocado run --vt-type libvirt --vt-config /home/sath/avocado-fvt-wrapper/data/avocado-vt/backends/libvirt/cfg/cpu.cfg                 --force-job-id d40eec16791f17cf5a0b3f0de03d2ab04da7a2d3  --vt-only-filter                                             "HostOS.ppc64le" --job-results-dir /home/sath/avocado-fvt-wrapper/results
JOB ID     : d40eec16791f17cf5a0b3f0de03d2ab04da7a2d3
JOB LOG    : /home/sath/avocado-fvt-wrapper/results/job-2018-02-19T00.23-d40eec1/job.log
 (001/853) guest_import.qemu.qcow2.virtio_scsi.smp2.virtio_net.Guest.HostOS.ppc64le.powerkvm-qemu.unattended_install.import.import.default_install.aio_native: -
Message from syslogd@ltczzj2 at Feb 19 00:24:14 ...
 kernel:Watchdog CPU:97 detected Hard LOCKUP other CPUS:38

dmesg:

[  186.241993] virbr0: port 2(vnet0) entered disabled state
[  200.167913] Watchdog CPU:97 detected Hard LOCKUP other CPUS:38
[  246.439590] INFO: rcu_sched self-detected stall on CPU
[  246.439710]  74-....: (6000 ticks this GP) idle=6d2/140000000000001/0 softirq=786/786 fqs=2889 
[  246.439795]   (t=6000 jiffies g=2031 c=2030 q=30958)
[  246.439852] NMI backtrace for cpu 74
[  246.439891] CPU: 74 PID: 6855 Comm: worker Not tainted 4.15.0-5.dev.git33f711f.el7.centos.ppc64le #1
[  246.439976] Call Trace:
[  246.440007] [c000203835602ef0] [c000000000b0c19c] dump_stack+0xb0/0xf4 (unreliable)
[  246.440082] [c000203835602f30] [c000000000b153c4] nmi_cpu_backtrace+0x1a4/0x210
[  246.440156] [c000203835602fc0] [c000000000b1560c] nmi_trigger_cpumask_backtrace+0x1dc/0x220
[  246.440232] [c000203835603060] [c00000000002f448] arch_trigger_cpumask_backtrace+0x28/0x40
[  246.440307] [c000203835603080] [c000000000197a6c] rcu_dump_cpu_stacks+0xfc/0x160
[  246.440380] [c0002038356030d0] [c000000000196c90] rcu_check_callbacks+0x8e0/0xaf0
[  246.440454] [c000203835603200] [c0000000001a3874] update_process_times+0x44/0x90
[  246.440529] [c000203835603230] [c0000000001bac8c] tick_sched_handle.isra.13+0x4c/0x80
[  246.440602] [c000203835603250] [c0000000001bad20] tick_sched_timer+0x60/0xc0
[  246.440675] [c000203835603290] [c0000000001a44a8] __hrtimer_run_queues+0xf8/0x330
[  246.440749] [c000203835603310] [c0000000001a524c] hrtimer_interrupt+0xec/0x290
[  246.440823] [c0002038356033d0] [c000000000024438] __timer_interrupt+0x98/0x280
[  246.440897] [c000203835603420] [c000000000024b58] timer_interrupt+0xa8/0xe0
[  246.440959] [c000203835603450] [c0000000000092a8] decrementer_common+0x158/0x160
[  246.441034] --- interrupt: 901 at smp_call_function_many+0x388/0x420
    LR = smp_call_function_many+0x328/0x420
[  246.441145] [c0002038356037b0] [c000000000072328] serialize_against_pte_lookup+0x38/0x50
[  246.441219] [c0002038356037d0] [c0000000000744d0] radix__pmdp_huge_get_and_clear+0x60/0x80
[  246.441294] [c000203835603800] [c00000000034598c] zap_huge_pmd+0x8c/0x540
[  246.441356] [c0002038356038a0] [c0000000002df120] unmap_page_range+0xe80/0x10a0
[  246.441430] [c0002038356039e0] [c0000000002df834] unmap_vmas+0x84/0x100
[  246.441491] [c000203835603a30] [c0000000002edf1c] exit_mmap+0x9c/0x1c0
[  246.441553] [c000203835603af0] [c0000000000ffd28] mmput+0xb8/0x1f0
[  246.441614] [c000203835603b20] [c00000000010a1c8] do_exit+0x358/0xcd0
[  246.441676] [c000203835603be0] [c00000000010ac14] do_group_exit+0x64/0x100
[  246.441738] [c000203835603c20] [c00000000011aba0] get_signal+0x210/0x700
[  246.441799] [c000203835603d10] [c00000000001cb00] do_signal+0x80/0x2e0
[  246.441861] [c000203835603e00] [c00000000001cf14] do_notify_resume+0xd4/0x100
[  246.441935] [c000203835603e30] [c00000000000bf44] ret_from_except_lite+0x70/0x74
[  426.487634] INFO: rcu_sched self-detected stall on CPU
[  426.487722]  74-....: (24004 ticks this GP) idle=6d2/140000000000001/0 softirq=786/786 fqs=11606 
[  426.487733]   (t=24004 jiffies g=2031 c=2030 q=119140)
[  426.487791] NMI backtrace for cpu 74
[  426.487830] CPU: 74 PID: 6855 Comm: worker Not tainted 4.15.0-5.dev.git33f711f.el7.centos.ppc64le #1
[  426.487915] Call Trace:

P:S: with powersave=off kernel commandline no issues seen.

cdeadmin commented 6 years ago

------- Comment From seg@us.ibm.com 2018-02-19 08:52:54 EDT------- OPAL team, please have a look.

cdeadmin commented 6 years ago

==== State: Rejected by: dcrowell on 19 February 2018 13:18:55 ====

I have tried to read the pointed-at bug and I can't make any sense of it so I have no idea what the fix is that is being referred to. It looks like there have been several iterations of different problems described. Please summarize what the expected fix is and reopen. Or better yet, just dupe this directly.

cdeadmin commented 6 years ago

------- Comment From satheera@in.ibm.com 2018-02-27 09:57:06 EDT------- I am able to reproduce the similar issue with DD2.2 aswell with the latest FW with powersave enabled and with powersave disabled(i.e powersave=off kernel cmdline) this issue is not seen.

Host Env $cupdcmd -f ...

Current Side Driver:.....fips910/b0223b_1810.910 Non-Current Side Driver:.fips910/b0215b_1808.910

Model: 2.2 (pvr 004e 1202) Model name: POWER9 (raw), altivec supported

Host kernel: 4.15.0-5.dev.git33f711f.el7.centos.ppc64le

kernel cmdline: root=/dev/mapper/host_os_9-root ro console=tty0 rd.lvm.lv=host_os_9/root rd.lvm.lv=host_os_9/swap LANG=en_US.UTF-8

Steps to reproduce.

Boot a kvm Guest and do a simple test(copy a file from host to guest , do an yum update) =======> Host suddenly freezes and reboots

FSP Event/error logs has this entry then, 5302D258 2018-02-27 14:17:04 CEC Hardware Subsystem Unrecoverable Error BB82C013 ?

Platform Event Log - 5302D258 Created at : 02/27/2018 14:17:03 Subsystem : CEC Hardware Subsystem Event Severity : Unrecoverable Error Action Flags : Report Externally Action Status : Reported to Opr Sys

Primary System Reference Code

Reference Code : BB82C013 Hex Words 2 - 5 : 00000080 00000000 00000000 00000000 Hex Words 6 - 9 : 00000000 00000000 00000000 00000000

Log Hex Dump 00000000 50480030 01004F43 20180227 14170300 00000010 20180227 14170430 4B000106 00000000 00000020 00000000 00000000 B0000002 5302D258 00000030 55480018 01004F43 50004000 00000000 00000040 00002000 01004500 50530050 01004F43 00000050 02000008 00000048 00000080 00000000 00000060 00000000 00000000 00000000 00000000 00000070 00000000 00000000 42423832 43303133 00000080 20202020 20202020 20202020 00000090 20202020 20202020 4548004C 01004F43 000000A0 39303038 2D32324C 31333642 45444100 000000B0 00000000 00000000 00000000 00000000 000000C0 00000000 00000000 00000000 00000000 000000D0 00000000 00000000 20180227 14170300 000000E0 00000000 4D54001C 01004F43 39303038 000000F0 2D32324C 31333642 45444100 00000000 00000100 5544004D 01004F43 44455343 00450000 00000110 4F43433A 20496E69 7469616C 697A6174 00000120 696F6E20 6F6E2061 6C6C2063 68697073 00000130 20646964 206E6F74 20636F6D 706C6574 00000140 65287469 6D656420 6F757429 0A PH.0..OC ..'....
..'...0K.......
............S..X
UH....OCP.@.....
.. ...E.PS.P..OC
.......H........
................
........BB82C013
20202020
EH.L..OC
9008-22L136BEDA.
................
................
........ ..'....
....MT....OC9008
-22L136BEDA.....
UD.M..OCDESC.E..
OCC: Initializat
ion on all chips
did not complet
e(timed out).

Regards, -Satheesh

cdeadmin commented 6 years ago

------- Comment From satheera@in.ibm.com 2018-02-27 10:08:25 EDT------- I just now noticed the issue after a few test runs inside host even with powersave disabled aswell. During the issue occurred, host was idle though, not running any tests/load.

FSP log event/errors histroy: Log ID Time Failing subsystem Severity SRC Reported 5302D602 2018-02-27 15:17:40 CEC Hardware Subsystem Unrecoverable Error BB82C013 ? ------------------------------------------------------------------------------------------------------------------->powersave=off 5002D4A0 2018-02-27 15:11:42 Service Processor Firmware Unrecoverable Error B1813450 ? 5002D295 2018-02-27 14:21:32 OCC Unrecoverable Error, Degraded Performance B18B2616 ? 5302D258 2018-02-27 14:17:04 CEC Hardware Subsystem Unrecoverable Error BB82C013 ? 5302D218 2018-02-27 14:12:48 CEC Hardware Subsystem Unrecoverable Error BB82C013 ? 5002D0A4 2018-02-27 14:07:06 Service Processor Firmware Unrecoverable Error B1813450 ? 5002CE92 2018-02-27 13:16:56 OCC Unrecoverable Error, Degraded Performance B18B2616 ?

Regards, -Satheesh

cdeadmin commented 6 years ago

------- Comment From npiggin@au1.ibm.com 2018-02-27 19:27:03 EDT------- Are you getting the same kernel logs with powersave=off? Or is this a new bug?

If the kernel messages include "smp_call_function_many", then we need to get traces from all CPUs. echo 1 > /proc/sys/kernel/hardlockup_panic should crash the system when you get a lockup and give traces.

cdeadmin commented 6 years ago

------- Comment From satheera@in.ibm.com 2018-02-28 06:11:34 EDT------- (In reply to comment #16) > Are you getting the same kernel logs with powersave=off? Or is this a new > bug? > > If the kernel messages include "smp_call_function_many", then we need to get > traces from all CPUs. echo 1 > /proc/sys/kernel/hardlockup_panic should > crash the system when you get a lockup and give traces.

Looks like this is different issue, am not completely sure though, host does not have any traces at all, it dies suddenly and FSP error log has the below issue reported, looked to checkstop?? , I just wanted reuse the FW bug which got mirrored already,

FSP Event/error logs has this entry then, 5302D258 2018-02-27 14:17:04 CEC Hardware Subsystem Unrecoverable Error BB82C013 ?

Platform Event Log - 5302D258 Created at : 02/27/2018 14:17:03 Subsystem : CEC Hardware Subsystem Event Severity : Unrecoverable Error Action Flags : Report Externally Action Status : Reported to Opr Sys

Regards, -Satheesh.

cdeadmin commented 6 years ago

------- Comment From bssrikanth@in.ibm.com 2018-03-06 02:27:12 EDT------- This bug is gating us from using the ZZ small core DD 2.2 for KVM testing. Please help resolve asap.

cdeadmin commented 6 years ago

------- Comment From satheera@in.ibm.com 2018-03-08 03:54:01 EDT------- Am able to reproduce the same(In reply to comment #16) > Are you getting the same kernel logs with powersave=off? Or is this a new > bug? > > If the kernel messages include "smp_call_function_many", then we need to get > traces from all CPUs. echo 1 > /proc/sys/kernel/hardlockup_panic should > crash the system when you get a lockup and give traces.

yes, looks like new bug, I will close this issue and open a new one for the issue, even the latest FW reproduces it.

Regards, -Satheesh

------- Comment From satheera@in.ibm.com 2018-03-08 03:55:46 EDT------- Initial issue reported in the bz is not seen.

open-power-host-os / linux

Hardlockup during a VM boot on zz with D2.0 with latest devel branch #25