Closed pridhiviraj closed 6 years ago
@stewart-ibm We are hitting this issue with latest code.
Product Name : OpenPOWER Firmware
Product Version : open-power-p9dsu-v1.21-28-gcd1f3d0-dirty
Product Extra : buildroot-2017.11.2-8-g4b6188e
Product Extra : skiboot-v5.10
Product Extra : hostboot-fed203b
Product Extra : linux-4.15.6-openpower1-pb642a39
Product Extra : petitboot-v1.7.0-pf2406aa
Product Extra : machine-xml-fb5f933
Product Extra : occ-bf6e716
Product Extra : hostboot-binarie
And system is in DD2.10 level.
Tested on witherspoon system with two variants DD2.1 and DD2.2 , works fine.
cat /var/lib/phosphor-software-manager/pnor/ro/VERSION
open-power-witherspoon-v1.21-28-gcd1f3d0-dirty
buildroot-2017.11.2-8-g4b6188e
skiboot-v5.10
hostboot-fed203b
linux-4.15.6-openpower1-pd8cd6c0
petitboot-v1.7.0-p7cfd0fc
machine-xml-6ca015d-pcea6bdc
occ-bf6e716
hostboot-binaries-f9351db
capp-ucode-p9-dd2-v3
sbe-9b78381
@pridhiviraj https://github.com/open-power/op-build/issues/1926#issuecomment-369219777 the hostboot-binaries version is truncated. it should have some commit id. Can you paste it again.
@Over-enthusiastic it's this one (we shorten the git SHAs in VERSION to try and get as much to appear over IPMI as possible)
Merge: 5ae6a9240ae2 945eaa0acf01
Author: Corey Swenson <cswenson@us.ibm.com>
Date: Fri Feb 23 17:17:26 2018 -0600
Merge pull request #63 from cvswen/hcode_update_911
Update HCODE image to hw022318a.911
Hitting Hard LOCKUP on a DD 2.2 system with 2/27 PNOR.
Firmware Revision : 01.15 IP address : 009.040.193.153
Firmware Build Time : 20180209 BMC MAC address : 0c:c4:7a:f4:4d:7c
PNOR Build Time : 20180227
CPLD Version : B2.91.00
The following is observed after the machine was booted and a P8 compat guest was started with <vcpu placement='static'>8</vcpu>
and
<cpu mode='host-model' check='partial'>
<model fallback='allow'>power8</model>
<topology sockets='1' cores='2' threads='4'/>
</cpu>
[ 922.638176] virbr0: port 3(vnet1) entered forwarding state
[ 922.638214] virbr0: topology change detected, propagating
[ 922.638254] virbr0: port 4(vnet2) entered forwarding state
[ 922.638290] virbr0: topology change detected, propagating
[ 13.446174948,3] CHIPTOD: Chip to TB timeout
[ 13.446174948,3] CHIPTOD: Resync failed ! TFMR=0x2812200960a24000
[ 13.446174948,3] CHIPTOD: OPAL: Resync timebase failed on CPU 0x0037
[ 1089.494077434,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000
[ 1089.494079017,7] HMI: [Loc: UOPWR.BOS0027-Node0-Proc0]: P:0 C:13 T:2: TFMR(2812000870e04000) Timer Facility Error
[ 1089.494083561,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000
[ 1089.494085046,7] HMI: [Loc: UOPWR.BOS0027-Node0-Proc0]: P:0 C:13 T:1: TFMR(2812000870e04000) Timer Facility Error
[ 932.743407] Severe Hypervisor Maintenance interrupt [Recovered]
[ 932.743471] Error detail: Timer facility experienced an error
[ 932.743512] HMER: 0840000000000000
[ 932.743531] TFMR: 2812200960a24000
[ 2013.243221] Watchdog CPU:34 Hard LOCKUP
[ 2013.243222] Modules linked in: binfmt_misc vhost_net vhost tap xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun devlink ipt_REJECT nf_reject_ipv4 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6_tables iptable_filter kvm_hv kvm rpcrdma sunrpc ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm i40iw ib_core ses enclosure scsi_transport_sas sg shpchp ibmpowernv at24 uio_pdrv_genirq ofpart powernv_flash ipmi_powernv uio mtd i2c_opal opal_prd ip_tables xfs libcrc32c sd_mod nvidia_drm(POE) nvidia_modeset(POE)
[ 2013.243271] nvidia(POE) ast i2c_algo_bit drm_kms_helper ttm syscopyarea sysfillrect sysimgblt fb_sys_fops drm i40e ipmi_devintf i2c_core ipmi_msghandler aacraid ptp pps_core dm_mirror dm_region_hash dm_log dm_mod
[ 2013.243286] CPU: 34 PID: 0 Comm: swapper/34 Kdump: loaded Tainted: P OE ------------ 4.14.0-43.el7a.ppc64le #1
[ 2013.243288] task: c000000ff5681700 task.stack: c000000ff5634000
[ 2013.243289] NIP: c0000000009d3a00 LR: c0000000009d0768 CTR: c0000000009d3a00
[ 2013.243291] REGS: c000000007e0fd80 TRAP: 0900 Tainted: P OE ------------ (4.14.0-43.el7a.ppc64le)
[ 2013.243292] MSR: 900000000280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 24002224 XER: 00000000
[ 2013.243299] CFAR: c0000000009d0764 SOFTE: 0
[ 2013.243299] GPR00: c0000000009d0730 c000000ff5637db0 c0000000014c7d00 c000000ffb4662d8
[ 2013.243299] GPR04: c0000000013d0d20 0000000000000000 0000000000000002 0000000000000000
[ 2013.243299] GPR08: 00000012bf400f81 0000000000000001 c0000000009d3a00 0000000000000f58
[ 2013.243299] GPR12: c0000000009d3a00 c000000007a37600 0000000000000800 c000200fff6e4608
[ 2013.243299] GPR16: 0000000200000000 c000000001045280 0000000000000000 c0000000013d0d20
[ 2013.243299] GPR20: c000000ffb4662d8 0000000000000000 c000000ff5634080 c000000ff5634080
[ 2013.243299] GPR24: c000000ff5634080 0000000000000000 000001d4bea58848 c0000000013d0d20
[ 2013.243299] GPR28: c0000000013d0d38 0000000000000000 c000000001502354 c000000ffb4662d8
[ 2013.243323] NIP [c0000000009d3a00] snooze_loop+0x0/0x1a0
[ 2013.243325] LR [c0000000009d0768] cpuidle_enter_state+0xc8/0x460
[ 2013.243325] Call Trace:
[ 2013.243327] [c000000ff5637db0] [c0000000009d0730] cpuidle_enter_state+0x90/0x460 (unreliable)
[ 2013.243331] [c000000ff5637e10] [c0000000001b5df0] do_idle+0x330/0x3c0
[ 2013.243335] [c000000ff5637ea0] [c0000000001b6078] cpu_startup_entry+0x38/0x40
[ 2013.243338] [c000000ff5637ed0] [c0000000000587c8] start_secondary+0x688/0x710
[ 2013.243341] [c000000ff5637f90] [c00000000000aa6c] start_secondary_prolog+0x10/0x14
[ 2013.243342] Instruction dump:
[ 2013.243344] 7fe3fb78 4bffb7b5 60000000 4bffb74d 60000000 38600000 38210030 e8010010
[ 2013.243348] ebe1fff8 7c0803a6 4e800020 60420000 <3c4c00af> 38424300 7c0802a6 fba1ffe8
[ 2013.243356] Sending NMI from CPU 34 to CPUs 0-33,35-159:
[ 937.371728] NMI backtrace for cpu 0
[ 937.371730] CPU: 0 PID: 0 Comm: swapper/0 Kdump: loaded Tainted: P OE ------------ 4.14.0-43.el7a.ppc64le #1
[ 937.371732] task: c000000001422280 task.stack: c0000000014c0000
[ 937.371733] NIP: c0000000000bb9c8 LR: c0000000000bb9c8 CTR: c000000000008000
[ 937.371734] REGS: c0000000014c3bf0 TRAP: 0100 Tainted: P OE ------------ (4.14.0-43.el7a.ppc64le)
[ 937.371734] MSR: 9000000000001033 <SF,HV,ME,IR,DR,RI,LE> CR: 24002822 XER: 00000000
[ 937.371738] CFAR: c0000000014c3de0 SOFTE: 0
[ 937.371738] GPR00: c0000000000bb9c8 c0000000014c3d50 c0000000014c7d00 c0000000014c3bf0
[ 937.371738] GPR04: b000000000001033 c0000000000bb9ac 0000000024002824 0000000000000000
[ 937.371738] GPR08: 0000000000000000 00000000000000ff 0000000000000010 000000000000a916
[ 937.371738] GPR12: 9000000000121033 c000000007a20000 0000000000000000 0000000000000000
[ 937.371738] GPR16: 0000000000000000 0000000000000000 0000000000000000 c0000000013d0d20
[ 937.371738] GPR20: c000000ffabe62d8 0000000000000000 c0000000014c0080 c0000000014c0080
[ 937.371738] GPR24: c0000000014c0080 0000000000000006 000000da3f8827f8 c0000000013d0d20
[ 937.371738] GPR28: c0000000013d0f78 0000000000000006 0000000000000000 9000000000121033
[ 937.371755] NIP [c0000000000bb9c8] power9_idle_type+0x78/0xa0
[ 937.371756] LR [c0000000000bb9c8] power9_idle_type+0x78/0xa0
[ 937.371757] Call Trace:
[ 937.371758] [c0000000014c3d50] [c0000000000bb9c8] power9_idle_type+0x78/0xa0 (unreliable)
[ 937.371761] [c0000000014c3d80] [c0000000009d3cb0] stop_loop+0x40/0x5c
[ 937.371762] [c0000000014c3db0] [c0000000009d0768] cpuidle_enter_state+0xc8/0x460
[ 937.371765] [c0000000014c3e10] [c0000000001b5df0] do_idle+0x330/0x3c0
[ 937.371767] [c0000000014c3ea0] [c0000000001b607c] cpu_startup_entry+0x3c/0x40
[ 937.371768] [c0000000014c3ed0] [c00000000000d158] rest_init+0xe8/0x100
[ 937.371771] [c0000000014c3f00] [c000000000f643b8] start_kernel+0x554/0x570
[ 937.371773] [c0000000014c3f90] [c00000000000ab7c] start_here_common+0x1c/0x520
[ 937.371774] Instruction dump:
[ 937.371775] 4bf62e89 60000000 7fdffb78 7fe3fb78 4bf7e53d 60000000 7c7f1b78 4bf62e2d
[ 937.371777] 60000000 7fe9fb78 7d234b78 4bf5aee5 <60000000> 38210030 e8010010 ebc1fff0
[ 937.371892] NMI backtrace for cpu 1
[ 937.371895] CPU: 1 PID: 0 Comm: swapper/1 Kdump: loaded Tainted: P OE ------------ 4.14.0-43.el7a.ppc64le #1
[ 937.371896] task: c000000ff54d1400 task.stack: c000000ff5530000
[ 937.371897] NIP: c0000000000bb9c8 LR: c0000000000bb9c8 CTR: c000000000008000
[ 937.371898] REGS: c000000ff5533bf0 TRAP: 0100 Tainted: P OE ------------ (4.14.0-43.el7a.ppc64le)
[ 937.371898] MSR: 9000000000001033 <SF,HV,ME,IR,DR,RI,LE> CR: 22002222 XER: 00000000
[ 937.371903] CFAR: c000000ff5533de0 SOFTE: 0
[ 937.371903] GPR00: c0000000000bb9c8 c000000ff5533d50 c0000000014c7d00 c000000ff5533bf0
[ 937.371903] GPR04: b000000000001033 c0000000000bb9ac 0000000022002224 0000000000000040
[ 937.371903] GPR08: 0000000000000000 00000000000000ff 0000000000000010 c00800000c140ad0
[ 937.371903] GPR12: 9000000000121033 c000000007a20b00 0000000000000800 c000200fff6daa08
[ 937.371903] GPR16: 0000000000000001 c000000001045280 0000000000000000 c0000000013d0d20
[ 937.371903] GPR20: c000000ffac262d8 0000000000000000 c000000ff5530080 c000000ff5530080
[ 937.371903] GPR24: c000000ff5530080 0000000000000002 000000d9f7661324 c0000000013d0d20
[ 937.371903] GPR28: c0000000013d0df8 0000000000000002 0000000000000000 9000000000121033
[ 937.371922] NIP [c0000000000bb9c8] power9_idle_type+0x78/0xa0
[ 937.371923] LR [c0000000000bb9c8] power9_idle_type+0x78/0xa0
[ 937.371923] Call Trace:
[ 937.371925] [c000000ff5533d50] [c0000000000bb9c8] power9_idle_type+0x78/0xa0 (unreliable)
[ 937.371927] [c000000ff5533d80] [c0000000009d3cb0] stop_loop+0x40/0x5c
[ 937.371929] [c000000ff5533db0] [c0000000009d0768] cpuidle_enter_state+0xc8/0x460
[ 937.371931] [c000000ff5533e10] [c0000000001b5df0] do_idle+0x330/0x3c0
[ 937.371934] [c000000ff5533ea0] [c0000000001b6078] cpu_startup_entry+0x38/0x40
[ 937.371936] [c000000ff5533ed0] [c0000000000587c8] start_secondary+0x688/0x710
[ 937.371938] [c000000ff5533f90] [c00000000000aa6c] start_secondary_prolog+0x10/0x14
[ 937.371939] Instruction dump:
[ 937.371940] 4bf62e89 60000000 7fdffb78 7fe3fb78 4bf7e53d 60000000 7c7f1b78 4bf62e2d
[ 937.371943] 60000000 7fe9fb78 7d234b78 4bf5aee5 <60000000> 38210030 e8010010 ebc1fff0
[ 937.372060] NMI backtrace for cpu 2
[ 937.372062] CPU: 2 PID: 0 Comm: swapper/2 Kdump: loaded Tainted: P OE ------------ 4.14.0-43.el7a.ppc64le #1
[ 937.372063] task: c000000ff54d2b00 task.stack: c000000ff5534000
[ 937.372064] NIP: c0000000000bb9c8 LR: c0000000000bb9c8 CTR: c000000000008000
[ 937.372065] REGS: c000000ff5537bf0 TRAP: 0100 Tainted: P OE ------------ (4.14.0-43.el7a.ppc64le)
[ 937.372066] MSR: 9000000000001033 <SF,HV,ME,IR,DR,RI,LE> CR: 22002222 XER: 00000000
[ 937.372070] CFAR: c000000ff5537de0 SOFTE: 1
[ 937.372070] GPR00: c0000000000bb9c8 c000000ff5537d50 c0000000014c7d00 c000000ff5537bf0
[ 937.372070] GPR04: b000000000001033 c0000000000bb9ac 0000000022002224 0000000000000000
[ 937.372070] GPR08: 0000000000000000 00000000000000ff 0000000000000010 0000000000000017
[ 937.372070] GPR12: 9000000002923033 c000000007a21600 0000000000000800 c000200fff6daa08
[ 937.372070] GPR16: 0000000000000002 c000000001045280 0000000000000000 c0000000013d0d20
[ 937.372070] GPR20: c000000ffac662d8 0000000000000000 c000000ff5534080 c000000ff5534080
[ 937.372070] GPR24: c000000ff5534080 0000000000000006 000000da3f894042 c0000000013d0d20
[ 937.372070] GPR28: c0000000013d0f78 0000000000000006 0000000000000000 9000000002923033
[ 937.372088] NIP [c0000000000bb9c8] power9_idle_type+0x78/0xa0
[ 937.372089] LR [c0000000000bb9c8] power9_idle_type+0x78/0xa0
@harish-24 I think this issue looks like similar to the internal BZ 164068. Please see the comments mentioned by mikey, see below patch is missing in your distro kernel.
commit d075745d893c78730e4a3b7a60fca23c2f764081
Author: Paul Mackerras <paulus@ozlabs.org>
Date: Wed Jan 17 20:51:13 2018 +1100
KVM: PPC: Book3S HV: Improve handling of debug-trigger HMIs on POWER9
@pridhiviraj Thanks for the pointer.
Also seeing this on DD2.2, Talos platform. Had to disable the problematic stop states in the machine XML as a workaround.
With current op-build, we've brought in enough bug fixes in hcode where this shouldn't be a problem anymore.
The only known issues are around: occ reset, specific dd2.1 parts (and these two issues should be addressed in the next day-ish).
As such, I'll close this issue and we can re-open if observed with current code.
Stop states enabled:
System is a P9DSU system.