open-power / op-build

Buildroot overlay for Open Power
GNU General Public License v2.0
104 stars 183 forks source link

With just stop states 0, 1 and 2 hitting special wakeup's and hardlockups with upstream code(DD2.1)(P9DSU). #1926

Closed pridhiviraj closed 6 years ago

pridhiviraj commented 6 years ago
 Petitboot (v1.7.0-pf2406aa)                  9006-12C         C819UAF32B00499
 ──────────────────────────────────────────────────────────────────────────────

  System information
  System configuration
  System status log
  Language
  Rescan devices
  Retrieve config from URL
  Plugins (0)
 *Exit to shell           

 ──────────────────────────────────────────────────────────────────────────────
 Enter=accept, e=edit, n=new, x=exit, l=language, g=log, h=help
 [enP2p1s0f0] Configuring with DHCP[   24.632157] Watchdog CPU:18 detected Hard LOCKUP other CPUS:72-79
[   98.157373068,3] Could not set special wakeup on 0:22: timeout waiting for SPECIAL_WKUP_DONE.
[   99.245895158,3] Could not set special wakeup on 0:22: timeout waiting for SPECIAL_WKUP_DONE.
[  100.334073125,3] Could not set special wakeup on 0:22: timeout waiting for SPECIAL_WKUP_DONE.
[  101.423082654,3] Could not set special wakeup on 0:22: timeout waiting for SPECIAL_WKUP_DONE.
[  103.003299454,3] Could not set special wakeup on 0:23: timeout waiting for SPECIAL_WKUP_DONE.
[  104.104799233,3] Could not set special wakeup on 0:23: timeout waiting for SPECIAL_WKUP_DONE.
[  105.207714046,3] Could not set special wakeup on 0:23: timeout waiting for SPECIAL_WKUP_DONE.
[  106.311920491,3] Could not set special wakeup on 0:23: timeout waiting for SPECIAL_WKUP_DONE.
[   34.139206] Kernel panic - not syncing: Hard LOCKUP
[   34.139543] CPU: 18 PID: 0 Comm: swapper/18 Not tainted 4.15.6-openpower1 #3
[   34.139809] AAC0: kernel 2.99-0[16] Apr 13 2017
[   34.139810] AAC0: monitor 0.0-0[0]
[   34.139811] AAC0: bios 0.13-209[32000]
[   34.139812] AAC0: serial 10F447
[   34.139813] AAC0: Non-DASD support enabled.
[   34.139815] AAC0: 64bit support enabled.
[   34.139817] aacraid 0003:01:00.0: Using 64-bit DMA iommu bypass
[   34.139822] aacraid 0003:01:00.0: 64 Bit DAC enabled
[   34.142311] Call Trace:
[   34.142628] [c000000ff62f3610] [c0000000005f5000] dump_stack+0x9c/0xd0 (unreliable)
[   34.143008] [c000000ff62f3650] [c000000000073154] panic+0x124/0x300
[   34.143389] [c000000ff62f36f0] [c000000000072d08] nmi_panic+0x58/0x7c
[   34.143780] [c000000ff62f3750] [c00000000001fa50] wd_timer_fn+0x200/0x2ec
[   34.144179] [c000000ff62f3810] [c0000000000cb0e0] call_timer_fn+0x30/0x90
[   34.144588] [c000000ff62f38a0] [c0000000000cb228] expire_timers+0xe8/0xf8
[   34.144969] scsi host0: aacraid
[   34.145434] [c000000ff62f3900] [c0000000000cb388] run_timer_softirq+0x150/0x1a4
[   34.145907] [c000000ff62f3990] [c00000000060b428] __do_softirq+0x240/0x29c
[   34.146389] [c000000ff62f3a80] [c000000000077020] irq_exit+0x88/0xe0
[   34.146885] [c000000ff62f3aa0] [c000000000018dc0] timer_interrupt+0xac/0xc0
[   34.147387] [c000000ff62f3ad0] [c000000000009078] decrementer_common+0x128/0x130
[   34.147903] --- interrupt: 901 at replay_interrupt_return+0x0/0x4
[   34.147903]     LR = arch_local_irq_restore+0x5c/0x80
[   34.148956] [c000000ff62f3dc0] [7fffffffffffffff] 0x7fffffffffffffff (unreliable)
[   34.149535] [c000000ff62f3de0] [c00000000051cf14] cpuidle_enter_state+0x1a4/0x210
[   34.150136] [c000000ff62f3e30] [c0000000000ab438] call_cpuidle+0x6c/0x74
[   34.150729] [c000000ff62f3e50] [c0000000000ab6e0] do_idle+0x1f0/0x204
[   34.151333] [c000000ff62f3ec0] [c0000000000ab880] cpu_startup_entry+0x30/0x34
[   34.151961] [c000000ff62f3ef0] [c00000000002b1e0] start_secondary+0x364/0x440
[   34.152611] [c000000ff62f3f90] [c00000000000ab6c] start_secondary_prolog+0x10/0x14
[   35.142140] Watchdog CPU:3 Hard LOCKUP
[   35.142141] Modules linked in: ofpart powernv_flash mtd i40e aacraid ast
[   35.142149] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.15.6-openpower1 #3
[   35.142152] NIP:  c000000000029d3c LR: c0000000000e04d0 CTR: c000000000029cf8
[   35.142154] REGS: c00000003ffdbd80 TRAP: 0900   Not tainted  (4.15.6-openpower1)
[   35.142154] MSR:  9000000000009033 <[  119.010314546,5] OPAL: Reboot request...
SF,HV,EE,ME,IR,DR,RI,LE>  CR: 28004228  XER: 00000000
[   35.142162] CFAR: c000000000029d3c SOFTE: 0 
[   35.142162] GPR00: c00000000002a6e0 c000000ff62b79e0 c000000001c20900 0000000000000000 
[   35.142162] GPR04: 0000000000000003 00000007ef667b39 0000000000573f6e 0000000000000000 
[   35.142162] GPR08: ffff0ffffff4f050 0000000000000000 0000000000000000 0000000000000000 
[   35.142162] GPR12: c000000000029cf8 c00000000fdc0d80 c000000ff62b7f90 0000000000000000 
[   35.142162] GPR16: 0000000000000000 c000000000029c58 c000000000029c30 c000000001c600d4 
[   35.142162] GPR20: 0000000000000000 000000000000000c c000000ff62b4000 c000000001c60044 
[   35.142162] GPR24: 0000000000000008 c000000ff62b4000 0000000000000000 c000000ffbe2fd08 
[   35.142162] GPR28: 0000000ffa2b0000 0000000000000000 0000000000000000 c000000ffbe339a0 
[   35.142192] NIP [c000000000029d3c] stop_this_cpu+0x44/0x48
[   35.142196] LR [c0000000000e04d0] flush_smp_call_function_queue+0x164/0x17c
[   35.142196] Call Trace:
[   35.142198] [c000000ff62b79e0] [c000000ff62b7a40] 0xc000000ff62b7a40 (unreliable)
[   35.142202] [c000000ff62b7a60] [c00000000002a6e0] smp_ipi_demux_relaxed+0x48/0x98
[   35.142205] [c000000ff62b7aa0] [c0000000000287c4] doorbell_exception+0x80/0xa4
[   35.142209] [c000000ff62b7ad0] [c000000000009ad8] h_doorbell_common+0x128/0x130
[   35.142214] --- interrupt: e81 at replay_interrupt_return+0x0/0x4
[   35.142214]     LR = arch_local_irq_restore+0x5c/0x80
[   35.142215] [c000000ff62b7dc0] [7fffffffffffffff] 0x7fffffffffffffff (unreliable)
[   35.142220] [c000000ff62b7de0] [c00000000051cf14] cpuidle_enter_state+0x1a4/0x210
[   35.142222] [c000000ff62b7e30] [c0000000000ab438] call_cpuidle+0x6c/0x74
[   35.142225] [c000000ff62b7e50] [c0000000000ab6e0] do_idle+0x1f0/0x204
[   35.142228] [c000000ff62b7ec0] [c0000000000ab880] cpu_startup_entry+0x30/0x34
[   35.142231] [c000000ff62b7ef0] [c00000000002b1e0] start_secondary+0x364/0x440
[   35.142234] [c000000ff62b7f90] [c00000000000ab6c] start_secondary_prolog+0x10/0x14
[   35.142235] Instruction dump:
[   35.142237] 5549ecf8 7d294214 554806be 39400001 7d4a4036 7d0048a8 7d085078 7d0049ad 
[   35.142242] 40c2fff4 39400000 892d027a 994d027a <48000000> 3c4c01bf 38426bc0 7c0802a6 
[   35.415896] Rebooting in 10 seconds..
[  119.114612384,5] RESET: Initiating fast reboot 1...
[  119.168426283,3] Could not set special wakeup on 0:22: timeout waiting for SPECIAL_WKUP_DONE.
[  119.172214011,5] RESET: Fast reboot failed to prepare secondaries for system reset

--== Welcome to Hostboot hostboot-8ea7d7e/hbicore.bin ==--

  3.01653|secure|SecureROM valid - enabling functionality
 14.79394|secure|Booting in non-secure mode.
 15.50732|Ignoring boot flags, incorrect version 0x0
 15.51402|Booting from SBE side 0 on master proc=00050000
 15.61530|ISTEP  6. 5 - host_init_fsi
 15.70136|ISTEP  6. 6 - host_set_ipl_parms
 15.76915|ISTEP  6. 7 - host_discover_targets
 16.25748|HWAS|PRESENT> DIMM[03]=A0A0000000000000
 16.25748|HWAS|PRESENT> Proc[05]=8800000000000000
 16.25749|HWAS|PRESENT> Core[07]=3F3FFF3F3FFF0000
 16.27655|ISTEP  6. 8 - host_update_master_tpm
 23.71749|SECURE|Security Access Bit> 0x0000000000000000
 23.71750|SECURE|Secure Mode Disable (via Jumper)> 0xC000000000000000
 23.71802|ISTEP  6. 9 - host_gard
 23.74410|HWAS|FUNCTIONAL> DIMM[03]=A0A0000000000000
 23.74411|HWAS|FUNCTIONAL> Proc[05]=8800000000000000

Stop states enabled:

/ # cat /sys/firmware/opal/msglog | grep -i [   50.570588] Watchdog CPU:62 detected Hard LOCKUP other CPUS:120-127
e[  124.266115959,3] Could not set special wakeup on 8:14: timeout waiting for SPECIAL_WKUP_DONE.
nabl[  125.356436607,3] Could not set special wakeup on 8:14: timeout waiting for SPECIAL_WKUP_DONE.
ing
[   60.482968300,5] SLW: Enabling: stop0_lite
[   60.484384023,5] SLW: Enabling: stop0
[   60.486510011,5] SLW: Enabling: stop1_lite
[   60.488639595,5] SLW: Enabling: stop1
[   60.490767713,5] SLW: Enabling: stop2_lite
[   60.492897688,5] SLW: Enabling: stop2

System is a P9DSU system.

pridhiviraj commented 6 years ago

@stewart-ibm We are hitting this issue with latest code.

 Product Name          : OpenPOWER Firmware
 Product Version       : open-power-p9dsu-v1.21-28-gcd1f3d0-dirty
 Product Extra         :    buildroot-2017.11.2-8-g4b6188e
 Product Extra         :    skiboot-v5.10
 Product Extra         :    hostboot-fed203b
 Product Extra         :    linux-4.15.6-openpower1-pb642a39
 Product Extra         :    petitboot-v1.7.0-pf2406aa
 Product Extra         :    machine-xml-fb5f933
 Product Extra         :    occ-bf6e716
 Product Extra         :    hostboot-binarie

And system is in DD2.10 level.

pridhiviraj commented 6 years ago

Tested on witherspoon system with two variants DD2.1 and DD2.2 , works fine.

 cat /var/lib/phosphor-software-manager/pnor/ro/VERSION 
open-power-witherspoon-v1.21-28-gcd1f3d0-dirty
    buildroot-2017.11.2-8-g4b6188e
    skiboot-v5.10
    hostboot-fed203b
    linux-4.15.6-openpower1-pd8cd6c0
    petitboot-v1.7.0-p7cfd0fc
    machine-xml-6ca015d-pcea6bdc
    occ-bf6e716
    hostboot-binaries-f9351db
    capp-ucode-p9-dd2-v3
    sbe-9b78381
Over-enthusiastic commented 6 years ago

@pridhiviraj https://github.com/open-power/op-build/issues/1926#issuecomment-369219777 the hostboot-binaries version is truncated. it should have some commit id. Can you paste it again.

ghost commented 6 years ago

@Over-enthusiastic it's this one (we shorten the git SHAs in VERSION to try and get as much to appear over IPMI as possible)

Merge: 5ae6a9240ae2 945eaa0acf01
Author: Corey Swenson <cswenson@us.ibm.com>
Date:   Fri Feb 23 17:17:26 2018 -0600

    Merge pull request #63 from cvswen/hcode_update_911

    Update HCODE image to hw022318a.911
harish-24 commented 6 years ago

Hitting Hard LOCKUP on a DD 2.2 system with 2/27 PNOR.

Firmware Revision : 01.15     IP address : 009.040.193.153
Firmware Build Time : 20180209     BMC MAC address : 0c:c4:7a:f4:4d:7c
PNOR Build Time : 20180227    
CPLD Version : B2.91.00

The following is observed after the machine was booted and a P8 compat guest was started with <vcpu placement='static'>8</vcpu> and

  <cpu mode='host-model' check='partial'>
    <model fallback='allow'>power8</model>
    <topology sockets='1' cores='2' threads='4'/>
  </cpu>
[  922.638176] virbr0: port 3(vnet1) entered forwarding state
[  922.638214] virbr0: topology change detected, propagating
[  922.638254] virbr0: port 4(vnet2) entered forwarding state
[  922.638290] virbr0: topology change detected, propagating
[   13.446174948,3] CHIPTOD: Chip to TB timeout
[   13.446174948,3] CHIPTOD: Resync failed ! TFMR=0x2812200960a24000
[   13.446174948,3] CHIPTOD: OPAL: Resync timebase failed on CPU 0x0037
[ 1089.494077434,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000
[ 1089.494079017,7] HMI: [Loc: UOPWR.BOS0027-Node0-Proc0]: P:0 C:13 T:2: TFMR(2812000870e04000) Timer Facility Error
[ 1089.494083561,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000
[ 1089.494085046,7] HMI: [Loc: UOPWR.BOS0027-Node0-Proc0]: P:0 C:13 T:1: TFMR(2812000870e04000) Timer Facility Error
[  932.743407] Severe Hypervisor Maintenance interrupt [Recovered]
[  932.743471]  Error detail: Timer facility experienced an error
[  932.743512]  HMER: 0840000000000000
[  932.743531]  TFMR: 2812200960a24000
[ 2013.243221] Watchdog CPU:34 Hard LOCKUP
[ 2013.243222] Modules linked in: binfmt_misc vhost_net vhost tap xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun devlink ipt_REJECT nf_reject_ipv4 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6_tables iptable_filter kvm_hv kvm rpcrdma sunrpc ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm i40iw ib_core ses enclosure scsi_transport_sas sg shpchp ibmpowernv at24 uio_pdrv_genirq ofpart powernv_flash ipmi_powernv uio mtd i2c_opal opal_prd ip_tables xfs libcrc32c sd_mod nvidia_drm(POE) nvidia_modeset(POE)
[ 2013.243271]  nvidia(POE) ast i2c_algo_bit drm_kms_helper ttm syscopyarea sysfillrect sysimgblt fb_sys_fops drm i40e ipmi_devintf i2c_core ipmi_msghandler aacraid ptp pps_core dm_mirror dm_region_hash dm_log dm_mod
[ 2013.243286] CPU: 34 PID: 0 Comm: swapper/34 Kdump: loaded Tainted: P           OE  ------------   4.14.0-43.el7a.ppc64le #1
[ 2013.243288] task: c000000ff5681700 task.stack: c000000ff5634000
[ 2013.243289] NIP:  c0000000009d3a00 LR: c0000000009d0768 CTR: c0000000009d3a00
[ 2013.243291] REGS: c000000007e0fd80 TRAP: 0900   Tainted: P           OE  ------------    (4.14.0-43.el7a.ppc64le)
[ 2013.243292] MSR:  900000000280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 24002224  XER: 00000000
[ 2013.243299] CFAR: c0000000009d0764 SOFTE: 0 
[ 2013.243299] GPR00: c0000000009d0730 c000000ff5637db0 c0000000014c7d00 c000000ffb4662d8 
[ 2013.243299] GPR04: c0000000013d0d20 0000000000000000 0000000000000002 0000000000000000 
[ 2013.243299] GPR08: 00000012bf400f81 0000000000000001 c0000000009d3a00 0000000000000f58 
[ 2013.243299] GPR12: c0000000009d3a00 c000000007a37600 0000000000000800 c000200fff6e4608 
[ 2013.243299] GPR16: 0000000200000000 c000000001045280 0000000000000000 c0000000013d0d20 
[ 2013.243299] GPR20: c000000ffb4662d8 0000000000000000 c000000ff5634080 c000000ff5634080 
[ 2013.243299] GPR24: c000000ff5634080 0000000000000000 000001d4bea58848 c0000000013d0d20 
[ 2013.243299] GPR28: c0000000013d0d38 0000000000000000 c000000001502354 c000000ffb4662d8 
[ 2013.243323] NIP [c0000000009d3a00] snooze_loop+0x0/0x1a0
[ 2013.243325] LR [c0000000009d0768] cpuidle_enter_state+0xc8/0x460
[ 2013.243325] Call Trace:
[ 2013.243327] [c000000ff5637db0] [c0000000009d0730] cpuidle_enter_state+0x90/0x460 (unreliable)
[ 2013.243331] [c000000ff5637e10] [c0000000001b5df0] do_idle+0x330/0x3c0
[ 2013.243335] [c000000ff5637ea0] [c0000000001b6078] cpu_startup_entry+0x38/0x40
[ 2013.243338] [c000000ff5637ed0] [c0000000000587c8] start_secondary+0x688/0x710
[ 2013.243341] [c000000ff5637f90] [c00000000000aa6c] start_secondary_prolog+0x10/0x14
[ 2013.243342] Instruction dump:
[ 2013.243344] 7fe3fb78 4bffb7b5 60000000 4bffb74d 60000000 38600000 38210030 e8010010 
[ 2013.243348] ebe1fff8 7c0803a6 4e800020 60420000 <3c4c00af> 38424300 7c0802a6 fba1ffe8 
[ 2013.243356] Sending NMI from CPU 34 to CPUs 0-33,35-159:
[  937.371728] NMI backtrace for cpu 0
[  937.371730] CPU: 0 PID: 0 Comm: swapper/0 Kdump: loaded Tainted: P           OE  ------------   4.14.0-43.el7a.ppc64le #1
[  937.371732] task: c000000001422280 task.stack: c0000000014c0000
[  937.371733] NIP:  c0000000000bb9c8 LR: c0000000000bb9c8 CTR: c000000000008000
[  937.371734] REGS: c0000000014c3bf0 TRAP: 0100   Tainted: P           OE  ------------    (4.14.0-43.el7a.ppc64le)
[  937.371734] MSR:  9000000000001033 <SF,HV,ME,IR,DR,RI,LE>  CR: 24002822  XER: 00000000
[  937.371738] CFAR: c0000000014c3de0 SOFTE: 0 
[  937.371738] GPR00: c0000000000bb9c8 c0000000014c3d50 c0000000014c7d00 c0000000014c3bf0 
[  937.371738] GPR04: b000000000001033 c0000000000bb9ac 0000000024002824 0000000000000000 
[  937.371738] GPR08: 0000000000000000 00000000000000ff 0000000000000010 000000000000a916 
[  937.371738] GPR12: 9000000000121033 c000000007a20000 0000000000000000 0000000000000000 
[  937.371738] GPR16: 0000000000000000 0000000000000000 0000000000000000 c0000000013d0d20 
[  937.371738] GPR20: c000000ffabe62d8 0000000000000000 c0000000014c0080 c0000000014c0080 
[  937.371738] GPR24: c0000000014c0080 0000000000000006 000000da3f8827f8 c0000000013d0d20 
[  937.371738] GPR28: c0000000013d0f78 0000000000000006 0000000000000000 9000000000121033 
[  937.371755] NIP [c0000000000bb9c8] power9_idle_type+0x78/0xa0
[  937.371756] LR [c0000000000bb9c8] power9_idle_type+0x78/0xa0
[  937.371757] Call Trace:
[  937.371758] [c0000000014c3d50] [c0000000000bb9c8] power9_idle_type+0x78/0xa0 (unreliable)
[  937.371761] [c0000000014c3d80] [c0000000009d3cb0] stop_loop+0x40/0x5c
[  937.371762] [c0000000014c3db0] [c0000000009d0768] cpuidle_enter_state+0xc8/0x460
[  937.371765] [c0000000014c3e10] [c0000000001b5df0] do_idle+0x330/0x3c0
[  937.371767] [c0000000014c3ea0] [c0000000001b607c] cpu_startup_entry+0x3c/0x40
[  937.371768] [c0000000014c3ed0] [c00000000000d158] rest_init+0xe8/0x100
[  937.371771] [c0000000014c3f00] [c000000000f643b8] start_kernel+0x554/0x570
[  937.371773] [c0000000014c3f90] [c00000000000ab7c] start_here_common+0x1c/0x520
[  937.371774] Instruction dump:
[  937.371775] 4bf62e89 60000000 7fdffb78 7fe3fb78 4bf7e53d 60000000 7c7f1b78 4bf62e2d 
[  937.371777] 60000000 7fe9fb78 7d234b78 4bf5aee5 <60000000> 38210030 e8010010 ebc1fff0 
[  937.371892] NMI backtrace for cpu 1
[  937.371895] CPU: 1 PID: 0 Comm: swapper/1 Kdump: loaded Tainted: P           OE  ------------   4.14.0-43.el7a.ppc64le #1
[  937.371896] task: c000000ff54d1400 task.stack: c000000ff5530000
[  937.371897] NIP:  c0000000000bb9c8 LR: c0000000000bb9c8 CTR: c000000000008000
[  937.371898] REGS: c000000ff5533bf0 TRAP: 0100   Tainted: P           OE  ------------    (4.14.0-43.el7a.ppc64le)
[  937.371898] MSR:  9000000000001033 <SF,HV,ME,IR,DR,RI,LE>  CR: 22002222  XER: 00000000
[  937.371903] CFAR: c000000ff5533de0 SOFTE: 0 
[  937.371903] GPR00: c0000000000bb9c8 c000000ff5533d50 c0000000014c7d00 c000000ff5533bf0 
[  937.371903] GPR04: b000000000001033 c0000000000bb9ac 0000000022002224 0000000000000040 
[  937.371903] GPR08: 0000000000000000 00000000000000ff 0000000000000010 c00800000c140ad0 
[  937.371903] GPR12: 9000000000121033 c000000007a20b00 0000000000000800 c000200fff6daa08 
[  937.371903] GPR16: 0000000000000001 c000000001045280 0000000000000000 c0000000013d0d20 
[  937.371903] GPR20: c000000ffac262d8 0000000000000000 c000000ff5530080 c000000ff5530080 
[  937.371903] GPR24: c000000ff5530080 0000000000000002 000000d9f7661324 c0000000013d0d20 
[  937.371903] GPR28: c0000000013d0df8 0000000000000002 0000000000000000 9000000000121033 
[  937.371922] NIP [c0000000000bb9c8] power9_idle_type+0x78/0xa0
[  937.371923] LR [c0000000000bb9c8] power9_idle_type+0x78/0xa0
[  937.371923] Call Trace:
[  937.371925] [c000000ff5533d50] [c0000000000bb9c8] power9_idle_type+0x78/0xa0 (unreliable)
[  937.371927] [c000000ff5533d80] [c0000000009d3cb0] stop_loop+0x40/0x5c
[  937.371929] [c000000ff5533db0] [c0000000009d0768] cpuidle_enter_state+0xc8/0x460
[  937.371931] [c000000ff5533e10] [c0000000001b5df0] do_idle+0x330/0x3c0
[  937.371934] [c000000ff5533ea0] [c0000000001b6078] cpu_startup_entry+0x38/0x40
[  937.371936] [c000000ff5533ed0] [c0000000000587c8] start_secondary+0x688/0x710
[  937.371938] [c000000ff5533f90] [c00000000000aa6c] start_secondary_prolog+0x10/0x14
[  937.371939] Instruction dump:
[  937.371940] 4bf62e89 60000000 7fdffb78 7fe3fb78 4bf7e53d 60000000 7c7f1b78 4bf62e2d 
[  937.371943] 60000000 7fe9fb78 7d234b78 4bf5aee5 <60000000> 38210030 e8010010 ebc1fff0 
[  937.372060] NMI backtrace for cpu 2
[  937.372062] CPU: 2 PID: 0 Comm: swapper/2 Kdump: loaded Tainted: P           OE  ------------   4.14.0-43.el7a.ppc64le #1
[  937.372063] task: c000000ff54d2b00 task.stack: c000000ff5534000
[  937.372064] NIP:  c0000000000bb9c8 LR: c0000000000bb9c8 CTR: c000000000008000
[  937.372065] REGS: c000000ff5537bf0 TRAP: 0100   Tainted: P           OE  ------------    (4.14.0-43.el7a.ppc64le)
[  937.372066] MSR:  9000000000001033 <SF,HV,ME,IR,DR,RI,LE>  CR: 22002222  XER: 00000000
[  937.372070] CFAR: c000000ff5537de0 SOFTE: 1 
[  937.372070] GPR00: c0000000000bb9c8 c000000ff5537d50 c0000000014c7d00 c000000ff5537bf0 
[  937.372070] GPR04: b000000000001033 c0000000000bb9ac 0000000022002224 0000000000000000 
[  937.372070] GPR08: 0000000000000000 00000000000000ff 0000000000000010 0000000000000017 
[  937.372070] GPR12: 9000000002923033 c000000007a21600 0000000000000800 c000200fff6daa08 
[  937.372070] GPR16: 0000000000000002 c000000001045280 0000000000000000 c0000000013d0d20 
[  937.372070] GPR20: c000000ffac662d8 0000000000000000 c000000ff5534080 c000000ff5534080 
[  937.372070] GPR24: c000000ff5534080 0000000000000006 000000da3f894042 c0000000013d0d20 
[  937.372070] GPR28: c0000000013d0f78 0000000000000006 0000000000000000 9000000002923033 
[  937.372088] NIP [c0000000000bb9c8] power9_idle_type+0x78/0xa0
[  937.372089] LR [c0000000000bb9c8] power9_idle_type+0x78/0xa0
pridhiviraj commented 6 years ago

@harish-24 I think this issue looks like similar to the internal BZ 164068. Please see the comments mentioned by mikey, see below patch is missing in your distro kernel.


  commit d075745d893c78730e4a3b7a60fca23c2f764081
  Author: Paul Mackerras <paulus@ozlabs.org>
  Date:   Wed Jan 17 20:51:13 2018 +1100
  KVM: PPC: Book3S HV: Improve handling of debug-trigger HMIs on POWER9
harish-24 commented 6 years ago

@pridhiviraj Thanks for the pointer.

madscientist159 commented 6 years ago

Also seeing this on DD2.2, Talos platform. Had to disable the problematic stop states in the machine XML as a workaround.

ghost commented 6 years ago

With current op-build, we've brought in enough bug fixes in hcode where this shouldn't be a problem anymore.

The only known issues are around: occ reset, specific dd2.1 parts (and these two issues should be addressed in the next day-ish).

As such, I'll close this issue and we can re-open if observed with current code.