linuxppc / issues

Issues repository for linuxppc
5 stars 0 forks source link

system hit a checkstop due to MCE during petitboot kernel boot when nr_cpus=3 #148

Open pridhiviraj opened 6 years ago

pridhiviraj commented 6 years ago

--== Welcome to Hostboot  ==--

  2.76615|secure|SecureROM valid - enabling functionality
  5.46073|Ignoring boot flags, incorrect version 0x0
  5.46840|Booting from SBE side 0 on master proc=00050000
  5.66132|ISTEP  6. 5 - host_init_fsi
  5.76601|ISTEP  6. 6 - host_set_ipl_parms
  6.01433|ISTEP  6. 7 - host_discover_targets
  7.80979|HWAS|PRESENT> DIMM[03]=F0F0000000000000
  7.80980|HWAS|PRESENT> Proc[05]=8800000000000000
  7.80982|HWAS|PRESENT> Core[07]=03FFCFCFCF0F0000
  8.00786|ISTEP  6. 8 - host_update_master_tpm
 15.56296|SECURE|Security Access Bit> 0xC000000000000000
 15.56297|SECURE|Secure Mode Disable (via Jumper)> 0x0000000000000000
 15.56311|ISTEP  6. 9 - host_gard
 16.12306|================================================
 16.12745|Error reported by prdf (0xE500) PLID 0x90000032
 16.12746|  PRD Signature            : 0x70025 0xDD3F000E
 16.13031|  Signature Description    : pu.core:k0:n0:s0:p01:c13 (COREFIR[14]) Machine check and ME = 0 Err
 16.13195|  UserData1   : 0x0007002500000101
 16.13196|  UserData2   : 0xdd3f000e00000000
 16.13197|------------------------------------------------
 16.13353|  Callout type             : Procedure Callout
 16.13354|  Procedure                : UNKNOWN: 0x11
 16.13355|  Priority                 : SRCI_PRIORITY_HIGH
 16.13356|------------------------------------------------
 16.13357|  Callout type             : Hardware Callout
 16.13357|  CPU id                   : 24
 16.13359|  Target                   : Physical:/Sys0/Node0/Proc1/EQ3/EX0/Core1
 16.13360|  Deconfig State           : NO_DECONFIG
 16.13361|  GARD Error Type          : GARD_NULL
 16.13362|  Priority                 : SRCI_PRIORITY_MED
 16.13362|------------------------------------------------
 16.13367|  
 16.13367|------------------------------------------------
 16.13368|  System checkstop occurred during runtime on previous boot
 16.13369|------------------------------------------------
 16.13369|  �
 16.13370|------------------------------------------------
 16.13371|  Hostboot Build ID: 
 16.13371|================================================
 16.13378|HWAS|FUNCTIONAL> DIMM[03]=F0F0000000000000
 16.13379|HWAS|FUNCTIONAL> Proc[05]=8800000000000000
 16.13380|HWAS|FUNCTIONAL> Core[07]=03FFCFCFCF0F0000
 16.41318|ISTEP  6.10 - host_revert_sbe_mcs_setup
 16.42341|ISTEP  6.11 - host_start_occ_xstop_handler
 17.51026|ISTEP  6.12 - host_voltage_config
 17.60302|ISTEP  7. 1 - mss_attr_cleanup
 18.01373|ISTEP  7. 2 - mss_volt
 18.10398|ISTEP  7. 3 - mss_freq
 19.38833|ISTEP  7. 4 - mss_eff_config
 21.10017|ISTEP  7. 5 - mss_attr_update
 21.12123|ISTEP  8. 1 - host_slave_sbe_config
 21.40689|ISTEP  8. 2 - host_setup_sbe
 21.41015|ISTEP  8. 3 - host_cbs_start
 21.43300|ISTEP  8. 4 - proc_check_slave_sbe_seeprom_complete
 26.45495|ISTEP  8. 5 - host_attnlisten_proc
 26.45855|ISTEP  8. 6 - host_p9_fbc_eff_config
 26.46474|ISTEP  8. 7 - host_p9_eff_config_links
 26.47603|ISTEP  8. 8 - proc_attr_update
 26.48060|ISTEP  8. 9 - proc_chiplet_fabric_scominit
 26.51916|ISTEP  8.10 - proc_xbus_scominit
 27.56821|ISTEP  8.11 - proc_xbus_enable_ridi
 27.57552|ISTEP  8.12 - host_set_voltages
 27.65234|ISTEP  9. 1 - fabric_erepair
 27.74205|ISTEP  9. 2 - fabric_io_dccal
 28.48444|ISTEP  9. 3 - fabric_pre_trainadv
 28.49189|ISTEP  9. 4 - fabric_io_run_training
 28.64331|ISTEP  9. 5 - fabric_post_trainadv
 28.64944|ISTEP  9. 6 - proc_smp_link_layer
 28.66016|ISTEP  9. 7 - proc_fab_iovalid
 28.96421|ISTEP  9. 8 - host_fbc_eff_config_aggregate
 28.97591|ISTEP 10. 1 - proc_build_smp
 29.11943|ISTEP 10. 2 - host_slave_sbe_update
 31.69852|ISTEP 10. 4 - proc_cen_ref_clk_enable
 31.82063|ISTEP 10. 5 - proc_enable_osclite
 31.82627|ISTEP 10. 6 - proc_chiplet_scominit
 31.89222|ISTEP 10. 7 - proc_abus_scominit
 31.90718|ISTEP 10. 8 - proc_obus_scominit
 31.92951|ISTEP 10. 9 - proc_npu_scominit
 31.93994|ISTEP 10.10 - proc_pcie_scominit
 34.09939|ISTEP 10.11 - proc_scomoverride_chiplets
 34.10498|ISTEP 10.12 - proc_chiplet_enable_ridi
 34.11154|ISTEP 10.13 - host_rng_bist
 34.11805|ISTEP 10.14 - host_update_redundant_tpm
 34.12650|ISTEP 11. 1 - host_prd_hwreconfig
 34.15393|ISTEP 11. 2 - cen_tp_chiplet_init1
 34.18230|ISTEP 11. 3 - cen_pll_initf
 34.18880|ISTEP 11. 4 - cen_pll_setup
 34.19749|ISTEP 11. 5 - cen_tp_chiplet_init2
 34.20466|ISTEP 11. 6 - cen_tp_arrayinit
 34.21128|ISTEP 11. 7 - cen_tp_chiplet_init3
 34.22032|ISTEP 11. 8 - cen_chiplet_init
 34.22730|ISTEP 11. 9 - cen_arrayinit
 34.23551|ISTEP 11.10 - cen_initf
 34.24155|ISTEP 11.11 - cen_do_manual_inits
 34.24999|ISTEP 11.12 - cen_startclocks
 34.25652|ISTEP 11.13 - cen_scominits
 34.26516|ISTEP 12. 1 - mss_getecid
 34.61529|ISTEP 12. 2 - dmi_attr_update
 34.62502|ISTEP 12. 3 - proc_dmi_scominit
 34.63408|ISTEP 12. 4 - cen_dmi_scominit
 34.64087|ISTEP 12. 5 - dmi_erepair
 34.64887|ISTEP 12. 6 - dmi_io_dccal
 34.97767|ISTEP 12. 7 - dmi_pre_trainadv
 34.98467|ISTEP 12. 8 - dmi_io_run_training
 35.00534|ISTEP 12. 9 - dmi_post_trainadv
 35.01277|ISTEP 12.10 - proc_cen_framelock
 35.01997|ISTEP 12.11 - host_startprd_dmi
 35.02640|ISTEP 12.12 - host_attnlisten_memb
 35.04340|ISTEP 12.13 - cen_set_inband_addr
 35.06724|ISTEP 13. 1 - host_disable_memvolt
 35.69214|ISTEP 13. 2 - mem_pll_reset
 35.71877|ISTEP 13. 3 - mem_pll_initf
 35.77015|ISTEP 13. 4 - mem_pll_setup
 35.80080|ISTEP 13. 6 - mem_startclocks
 35.82222|ISTEP 13. 7 - host_enable_memvolt
 35.83065|ISTEP 13. 8 - mss_scominit
 36.18359|ISTEP 13. 9 - mss_ddr_phy_reset
 36.28323|ISTEP 13.10 - mss_draminit
 36.52400|ISTEP 13.11 - mss_draminit_training
 38.17485|ISTEP 13.12 - mss_draminit_trainadv
 38.46931|ISTEP 13.13 - mss_draminit_mc
 38.50024|ISTEP 14. 1 - mss_memdiag
 50.40218|ISTEP 14. 2 - mss_thermal_init
 50.45997|ISTEP 14. 3 - proc_pcie_config
 50.49402|ISTEP 14. 4 - mss_power_cleanup
 50.50144|ISTEP 14. 5 - proc_setup_bars
 50.55536|ISTEP 14. 6 - proc_htm_setup
 50.56959|ISTEP 14. 7 - proc_exit_cache_contained
 50.59354|ISTEP 15. 1 - host_build_stop_image
 53.40735|ISTEP 15. 2 - proc_set_pba_homer_bar
 53.42443|ISTEP 15. 3 - host_establish_ex_chiplet
 53.48407|ISTEP 15. 4 - host_start_stop_engine
 53.51799|ISTEP 16. 1 - host_activate_master
 54.69093|ISTEP 16. 2 - host_activate_slave_cores
 54.91665|ISTEP 16. 3 - host_secure_rng
 54.93196|ISTEP 16. 4 - mss_scrub
 54.96972|ISTEP 16. 5 - host_ipl_complete
 55.07807|ISTEP 16. 6 - host_load_io_ppe
 55.15680|ISTEP 18.11 - proc_tod_setup
 55.25262|ISTEP 18.12 - proc_tod_init
 55.27529|ISTEP 20. 1 - host_load_payload
 55.90239|ISTEP 20. 2 - host_load_hdat
 59.21047|ISTEP 21. 1 - host_runtime_setup
 67.47014|htmgt|OCCs are now running in ACTIVE state
 71.44499|ISTEP 21. 2 - host_verify_hdat
 71.45181|ISTEP 21. 3 - host_start_payload
[   71.417768216,5] OPAL v6.0-rc1-32-gfbca0a0 starting...
[   71.417771727,7] initial console log level: memory 7, driver 5
[   71.417774031,6] CPU: P9 generation processor (max 4 threads/core)
[   71.417775927,7] CPU: Boot CPU PIR is 0x0038 PVR is 0x004e1202
[   71.417778717,7] OPAL table: 0x30103430 .. 0x301039e0, branch table: 0x30002000
[   71.417781746,7] Assigning physical memory map table for nimbus
[   71.417784310,7] Parsing HDAT...
[   71.417785684,5] SPIRA-S found.
[   71.417788227,6] BMC #0: HW version 3, SW version 2, chip DD1.0
[   71.418066972,6] SP Family is ibm,ast2500,supermicro
[   71.418072916,7] LPC: IOPATH chip id = 0
[   71.418074271,7] LPC: FW BAR       = f0000000
[   71.418075808,7] LPC: MEM BAR      = e0000000
[   71.418077283,7] LPC: IO BAR       = d0010000
[   71.418078735,7] LPC: Internal BAR = c0012000
[   71.418090131,7] LPC UART: base addr = 3f8 (3f8) size = 1 clk = 1843200, baud = 115200
[   71.418092995,7] LPC: BT [0, 0] sms_int: 0, bmc_int: 0
[   71.467037986,5] UART: Using UART at 0x60300d00103f8
[   71.472527404,5] P9 DD2.20 detected
[   71.474284391,5] CHIP: Chip ID 0000 type: P9N DD2.2
[   71.476632016,5] P9 DD2.20 detected
[   71.478952733,5] CHIP: Chip ID 0008 type: P9N DD2.2
[   71.481951977,5] PLAT: Using Supe%=��    ��S&���&N�0662,7] LPC: Routing irq 9, policy: 0 (r=1)
[   71.483802473,7] LPC: SerIRQ 9 using route 0 targetted at OPAL
[   71.489623658,5] PLAT: Detected p9dsu platform
[   71.491943520,5] PLAT: Detected BMC platform SMC
[   72.010781309,5] CPU: All 128 processors called in...
[   74.176423316,5] FLASH: Found system flash:  id:0
[   74.178227179,5] BT: Interface initialized, IO 0x00e4
[   74.181044642,7] LPC: Routing irq 10, policy: 0 (r=1)
[   74.181045712,7] LPC: SerIRQ 10 using route 1 targetted at OPAL
[   75.292043373,3] STB: container NOT VERIFIED, resource_id=4 secureboot not yet initialized
[   76.080067212,5] NVRAM: Size is 576 KB
[   76.281790451,5] STB: Found ibm,secureboot-v2
[   76.283223304,5] STB: secure mode on
[   76.285357402,5] STB: trusted mode off
[   76.287477357,5] OPAL: Using OPAL UART console
[   76.287490186,7] LPC-MBOX: Sending BMC interrupt
[   76.291743322,7] LPC: Routing irq 4, policy: 0 (r=1)
[   76.291744458,7] LPC: SerIRQ 4 using route 2 targetted at OPAL
[   76.297575054,5] SLW: Enabling: stop0_lite
[   76.299546680,5] SLW: Enabling: stop0
[   76.301675145,5] SLW: Enabling: stop1_lite
[   76.303100475,5] SLW: Enabling: stop1
[   76.305229101,5] SLW: Enabling: stop2_lite
[   76.307358172,5] SLW: Enabling: stop2
[   76.309484945,5] SLW: Disabling: stop4 in OPAL
[   76.311617268,5] SLW: Disabling: stop5 in OPAL
[   76.314965026,5] VAS: Initialized chip 0
[   76.317756017,5] VAS: Initialized chip 8
[   76.319591732,5] Found ibm,power9-nx
[   76.321021666,5] Found ibm,power9-nx
[   77.010145901,5] STB: IMA_CATALOG verified
[   77.356183539,5] STB: CAPP verified
[   77.383143878,5] PCI: Resetting PHBs and training links...
[   79.393835751,3] PHB#0000[0:0]: LINK: Timeout waiting for link up
[   79.393900827,3] PHB#0030[8:0]: LINK: Timeout waiting for link up
[   79.393902006,7] PHB#0030[8:0]: LINK: DLP train control: 0x00001c0101000000
[   79.394424743,7] PHB#0030[8:0]: CRESET: Starts
[   79.399599202,7] PHB#0030[8:0]: CRESET: No pending transactions
[   79.404732570,7] PHB#0030[8:0]: CRESET: Reinitialization
[   79.404733832,7] PHB#0030[8:0]: Initializing PHB4...
[   79.404735790,7] PHB#0030[8:0]: Default system config: 0x4410000040400000
[   79.404737573,7] PHB#0030[8:0]: New system config    : 0x4410000040400000
[   79.404739308,7] PHB#0030[8:0]: Initial PHB CRESET is 0xe000000000000000
[   79.404741286,7] PHB#0030[8:0]: Waiting for DLP PG reset to complete...
[   79.405264120,7] PHB#0030[8:0]: Reset state SRC_ID: 0ff8000000000000
[   79.405752916,7] PHB#0030[8:0]: Initialization complete
[   82.081404139,3] PHB#0030[8:0]: LINK: Timeout waiting for link up
[   82.120351916,3] PHB#0000[0:0]: LINK: Timeout waiting for link up
[   84.288880779,3] PHB#0030[8:0]: LINK: Timeout waiting for link up
[   84.322361354,3] PHB#0000[0:0]: LINK: Timeout waiting for link up
[   86.490706276,3] PHB#0030[8:0]: LINK: Timeout waiting for link up
[   86.493218979,3] PHB#0030[8:0]: Electrical link detected but won't train
[   87.012539973,3] PHB#0000[0:0]: LINK: Timeout waiting for link up
[   87.014905200,3] PHB#0000[0:0]: Electrical link detected but won't train
[   87.019683290,5] PCI: Probing slots...
[   87.077628203,5] PHB#0000:00:00.0 [ROOT] 1014 04c1 R:00 C:060400 B:01..ff SLOT=UIO Slot1 
[   87.081504252,5] PHB#0001:00:00.0 [ROOT] 1014 04c1 R:00 C:060400 B:01..05 SLOT=PLX 
[   87.085058181,5] PHB#0001:01:00.0 [SWUP] 10b5 8725 R:ca C:060400 B:02..05 SLOT=PLX up 
[   87.089319493,5] PHB#0001:02:01.0 [SWDN] 10b5 8725 R:ca C:060400 B:03..03 SLOT=UIO Slot2 
[   87.093571812,5] PHB#0001:03:00.0 [EP  ] 1000 00c9 R:01 C:010700 (           sas) LOC_CODE=UIO Slot2
[   87.097837634,5] PHB#0001:02:08.0 [SWDN] 10b5 8725 R:ca C:060400 B:04..04 SLOT=PLX switch 
[   87.102098088,5] PHB#0001:02:09.0 [SWDN] 10b5 8725 R:ca C:060400 B:05..05 SLOT=Onboard LAN 
[   87.106359425,5] PHB#0001:05:00.0 [EP  ] 8086 1589 R:02 C:020000 (      ethernet) LOC_CODE=Onboard LAN
[   87.111323320,5] PHB#0001:05:00.1 [EP  ] 8086 1589 R:02 C:020000 (      ethernet) LOC_CODE=Onboard LAN
[   87.116292443,5] PHB#0001:05:00.2 [EP  ] 8086 1589 R:02 C:020000 (      ethernet) LOC_CODE=Onboard LAN
[   87.120557576,5] PHB#0001:05:00.3 [EP  ] 8086 1589 R:02 C:020000 (      ethernet) LOC_CODE=Onboard LAN
[   87.125523657,5] PHB#0001:01:00.1 [EP  ] 10b5 87d0 R:ca C:088000 (system-peripheral) LOC_CODE=PLX
[   87.129785306,5] PHB#0001:01:00.2 [EP  ] 10b5 87d0 R:ca C:088000 (system-peripheral) LOC_CODE=PLX
[   87.134750664,5] PHB#0001:01:00.3 [EP  ] 10b5 87d0 R:ca C:088000 (system-peripheral) LOC_CODE=PLX
[   87.139014182,5] PHB#0001:01:00.4 [EP  ] 10b5 87d0 R:ca C:088000 (system-peripheral) LOC_CODE=PLX
[   87.143981025,5] PHB#0002:00:00.0 [ROOT] 1014 04c1 R:00 C:060400 B:01..01 SLOT=UIO Slot3 
[   87.147536675,5] PHB#0002:01:00.0 [EP  ] 1000 00c9 R:01 C:010700 (           sas) LOC_CODE=UIO Slot3
[   87.152504558,5] PHB#0003:00:00.0 [ROOT] 1014 04c1 R:00 C:060400 B:01..01 SLOT=Onboard SAS 
[   87.156763378,5] PHB#0003:01:00.0 [EP  ] 9005 028d R:01 C:010700 (           sas) LOC_CODE=Onboard SAS
[   87.161732274,5] PHB#0004:00:00.0 [ROOT] 1014 04c1 R:00 C:060400 B:01..02 SLOT=Onboard BMC 
[   87.165991086,5] PHB#0004:01:00.0 [ETOX] 1a03 1150 R:04 C:060400 B:02..02 LOC_CODE=Onboard BMC
[   87.170257034,5] PHB#0004:02:00.0 [PCID] 1a03 2000 R:41 C:030000 (           vga) LOC_CODE=Onboard BMC
[   87.175222002,5] PHB#0005:00:00.0 [ROOT] 1014 04c1 R:00 C:060400 B:01..01 SLOT=Onboard USB 
[   87.179481797,5] PHB#0005:01:00.0 [EP  ] 104c 8241 R:02 C:0c0330 (      usb-xhci) LOC_CODE=Onboard USB
[   87.183745945,5] PHB#0030:00:00.0 [ROOT] 1014 04c1 R:00 C:060400 B:01..ff SLOT=WIO Slot3 
[   87.188006980,5] PHB#0031:00:00.0 [ROOT] 1014 04c1 R:00 C:060400 B:01..01 SLOT=WIO-R Slot 
[   87.192264389,5] PHB#0031:01:00.0 [EP  ] 1000 00c9 R:01 C:010700 (           sas) LOC_CODE=WIO-R Slot
[   87.197233087,5] PHB#0032:00:00.0 [ROOT] 1014 04c1 R:00 C:060400 B:01..09 SLOT=PLX 
[   87.200788457,5] PHB#0032:01:00.0 [SWUP] 10b5 8725 R:ca C:060400 B:02..09 SLOT=PLX up 
[   87.205049827,5] PHB#0032:02:01.0 [SWDN] 10b5 8725 R:ca C:060400 B:03..03 SLOT=WIO Slot1 
[   87.209302341,5] PHB#0032:03:00.0 [EP  ] 1000 00c9 R:01 C:010700 (           sas) LOC_CODE=WIO Slot1
[   87.213568605,5] PHB#0032:02:08.0 [SWDN] 10b5 8725 R:ca C:060400 B:04..04 SLOT=PLX switch 
[   87.217828632,5] PHB#0032:02:09.0 [SWDN] 10b5 8725 R:ca C:060400 B:05..09 SLOT=WIO Slot2 
[   87.222084774,5] PHB#0032:01:00.1 [EP  ] 10b5 87d0 R:ca C:088000 (system-peripheral) LOC_CODE=PLX
[   87.227048612,5] PHB#0032:01:00.2 [EP  ] 10b5 87d0 R:ca C:088000 (system-peripheral) LOC_CODE=PLX
[   87.231312318,5] PHB#0032:01:00.3 [EP  ] 10b5 87d0 R:ca C:088000 (system-peripheral) LOC_CODE=PLX
[   87.235574375,5] PHB#0032:01:00.4 [EP  ] 10b5 87d0 R:ca C:088000 (system-peripheral) LOC_CODE=PLX
[   87.240541250,5] PHB#0033:00:00.0 [ROOT] 1014 04c1 R:00 C:060400 B:01..01 SLOT=WIO Slot4 
[   87.244805930,5] PHB#0033:01:00.0 [EP  ] 11f8 f117 R:06 C:010802 (  mass-storage) LOC_CODE=WIO Slot4
[   92.105189705,5] STB: BOOTKERNEL verified
[   92.469235099,5] IPMI: Resetting boot count on successful boot
[   92.471641936,5] INIT: Waiting for kernel...
[   92.473774250,5] INIT: 64-bit LE kernel discovered
[   92.476615512,5] OCC: All Chip Rdy after 0 ms
[   92.505777691,5] INIT: Starting kernel at 0x20011000, fdt at 0x30a96f00 364423 bytes
[   94.174959491,5] OPAL: Switch to little-endian OS

--== Welcome to Hostboot  ==--

  2.76615|secure|SecureROM valid - enabling functionality
  5.44276|Ignoring boot flags, incorrect version 0x0
  5.44635|Booting from SBE side 0 on master proc=00050000
  5.64320|ISTEP  6. 5 - host_init_fsi
  5.74952|ISTEP  6. 6 - host_set_ipl_parms
  6.02270|ISTEP  6. 7 - host_discover_targets
  7.81123|HWAS|PRESENT> DIMM[03]=F0F0000000000000
  7.81124|HWAS|PRESENT> Proc[05]=8800000000000000
  7.81125|HWAS|PRESENT> Core[07]=03FFCFCFCF0F0000
  8.04136|ISTEP  6. 8 - host_update_master_tpm
 15.64617|SECURE|Security Access Bit> 0xC000000000000000
 15.64618|SECURE|Secure Mode Disable (via Jumper)> 0x0000000000000000
 15.64628|ISTEP  6. 9 - host_gard
 16.23569|================================================
 16.23993|Error reported by prdf (0xE500) PLID 0x9000003C
 16.23993|  PRD Signature            : 0x7000E 0xDD3F000E
 16.24269|  Signature Description    : pu.core:k0:n0:s0:p00:c14 (COREFIR[14]) Machine check and ME = 0 Err
 16.24429|  UserData1   : 0x0007000e00000101
 16.24430|  UserData2   : 0xdd3f000e00000000
 16.24430|------------------------------------------------
 16.24616|HWAS|FUNCTIONAL> DIMM[03]=F0F0000000000000
 16.24617|HWAS|FUNCTIONAL> Proc[05]=8800000000000000
 16.24618|HWAS|FUNCTIONAL> Core[07]=03FFCFCFCF0F0000
 16.24737|  Callout type             : Procedure Callout
 16.24737|  Procedure                : UNKNOWN: 0x11
 16.24738|  Priority                 : SRCI_PRIORITY_HIGH
 16.24739|------------------------------------------------
 16.24740|  Callout type             : Hardware Callout
 16.24740|  CPU id                   : 24
 16.24743|  Target                   : Physical:/Sys0/Node0/Proc0/EQ3/EX1/Core0
 16.24743|  Deconfig State           : NO_DECONFIG
 16.24744|  GARD Error Type          : GARD_NULL
 16.24745|  Priority                 : SRCI_PRIORITY_MED
 16.24746|------------------------------------------------
 16.24746|  
 16.24747|------------------------------------------------
 16.24748|  System checkstop occurred during runtime on previous boot
 16.24748|------------------------------------------------
 16.24749|  �
 16.24749|------------------------------------------------
 16.24750|  Hostboot Build ID: 
 16.24751|================================================
 16.55735|ISTEP  6.10 - host_revert_sbe_mcs_setup
 16.56152|ISTEP  6.11 - host_start_occ_xstop_handler
 17.66230|ISTEP  6.12 - host_voltage_config
 17.74889|ISTEP  7. 1 - mss_attr_cleanup
 18.17470|ISTEP  7. 2 - mss_volt
 18.25531|ISTEP  7. 3 - mss_freq
 19.54236|ISTEP  7. 4 - mss_eff_config
 21.24973|ISTEP  7. 5 - mss_attr_update
 21.26906|ISTEP  8. 1 - host_slave_sbe_config
 21.57119|ISTEP  8. 2 - host_setup_sbe
 21.57542|ISTEP  8. 3 - host_cbs_start
 21.59906|ISTEP  8. 4 - proc_check_slave_sbe_seeprom_complete
 26.63456|ISTEP  8. 5 - host_attnlisten_proc
 26.63869|ISTEP  8. 6 - host_p9_fbc_eff_config
 26.64481|ISTEP  8. 7 - host_p9_eff_config_links
 26.65603|ISTEP  8. 8 - proc_attr_update
 26.65956|ISTEP  8. 9 - proc_chiplet_fabric_scominit
 26.69826|ISTEP  8.10 - proc_xbus_scominit
 27.74845|ISTEP  8.11 - proc_xbus_enable_ridi
 27.75591|ISTEP  8.12 - host_set_voltages
 27.83845|ISTEP  9. 1 - fabric_erepair
 27.92621|ISTEP  9. 2 - fabric_io_dccal
 28.66381|ISTEP  9. 3 - fabric_pre_trainadv
 28.67123|ISTEP  9. 4 - fabric_io_run_training
 28.82165|ISTEP  9. 5 - fabric_post_trainadv
 28.83370|ISTEP  9. 6 - proc_smp_link_layer
 28.84574|ISTEP  9. 7 - proc_fab_iovalid
 29.14034|ISTEP  9. 8 - host_fbc_eff_config_aggregate
 29.15201|ISTEP 10. 1 - proc_build_smp
 29.30240|ISTEP 10. 2 - host_slave_sbe_update
 31.95169|ISTEP 10. 4 - proc_cen_ref_clk_enable
 32.07355|ISTEP 10. 5 - proc_enable_osclite
 32.07794|ISTEP 10. 6 - proc_chiplet_scominit
 32.13658|ISTEP 10. 7 - proc_abus_scominit
 32.15334|ISTEP 10. 8 - proc_obus_scominit
 32.15993|ISTEP 10. 9 - proc_npu_scominit
 32.16930|ISTEP 10.10 - proc_pcie_scominit
 34.33582|ISTEP 10.11 - proc_scomoverride_chiplets
 34.34244|ISTEP 10.12 - proc_chiplet_enable_ridi
 34.34892|ISTEP 10.13 - host_rng_bist
 34.35421|ISTEP 10.14 - host_update_redundant_tpm
 34.36154|ISTEP 11. 1 - host_prd_hwreconfig
 34.38880|ISTEP 11. 2 - cen_tp_chiplet_init1
 34.39593|ISTEP 11. 3 - cen_pll_initf
 34.41228|ISTEP 11. 4 - cen_pll_setup
 34.42101|ISTEP 11. 5 - cen_tp_chiplet_init2
 34.42838|ISTEP 11. 6 - cen_tp_arrayinit
 34.43535|ISTEP 11. 7 - cen_tp_chiplet_init3
 34.44235|ISTEP 11. 8 - cen_chiplet_init
 34.45036|ISTEP 11. 9 - cen_arrayinit
 34.45897|ISTEP 11.10 - cen_initf
 34.46497|ISTEP 11.11 - cen_do_manual_inits
 34.47192|ISTEP 11.12 - cen_startclocks
 34.47855|ISTEP 11.13 - cen_scominits
 34.48688|ISTEP 12. 1 - mss_getecid
 34.82699|ISTEP 12. 2 - dmi_attr_update
 34.83562|ISTEP 12. 3 - proc_dmi_scominit
 34.84232|ISTEP 12. 4 - cen_dmi_scominit
 34.84894|ISTEP 12. 5 - dmi_erepair
 34.85692|ISTEP 12. 6 - dmi_io_dccal
 35.18606|ISTEP 12. 7 - dmi_pre_trainadv
 35.19317|ISTEP 12. 8 - dmi_io_run_training
 35.21275|ISTEP 12. 9 - dmi_post_trainadv
 35.21984|ISTEP 12.10 - proc_cen_framelock
 35.22726|ISTEP 12.11 - host_startprd_dmi
 35.23467|ISTEP 12.12 - host_attnlisten_memb
 35.24126|ISTEP 12.13 - cen_set_inband_addr

It is pretty much easily re-producible all the times, and system never boots, it keeps hitting same xstop on every subsequent boot.

As per @maheshsal suggestion tried same boot test with nr_cpus=4 , we are hitting a different checkstop due to core IF Logic Recovery ERR.

[   95.057862234,5] OPAL v6.0-rc1-32-gfbca0a0 starting...
[   95.057865993,7] initial console log level: memory 7, driver 5
[   95.057868152,6] CPU: P9 generation processor (max 4 threads/core)
[   95.057870190,7] CPU: Boot CPU PIR is 0x085c PVR is 0x004e1202
[   95.057872874,7] OPAL table: 0x30103430 .. 0x301039e0, branch table: 0x30002000
[   95.057875706,7] Assigning physical memory map table for nimbus
[   95.057878770,7] Parsing HDAT...
[   95.057880006,5] SPIRA-S found.
[   95.057883229,6] BMC #0: HW version 3, SW version 2, chip DD1.0
[   95.058162248,6] SP Family is ibm,ast2500,supermicro
[   95.058168893,7] LPC: IOPATH chip id = 0
[   95.058170299,7] LPC: FW BAR       = f0000000
[   95.058171840,7] LPC: MEM BAR      = e0000000
[   95.058173356,7] LPC: IO BAR       = d0010000
[   95.058174824,7] LPC: Internal BAR = c0012000
[   95.058186270,7] LPC UART: base addr = 3f8 (3f8) size = 1 clk = 1843200, baud = 115200
[   95.058189138,7] LPC: BT [0, 0] sms_int: 0, bmc_int: 0
[   95.107137046,5] UART: Using UART at 0x60300d00103f8
[   95.110335269,4] CORE[24]: HW_PROC_ID=8 PROC_CHIP_ID=1 EC=0x22 UNAVAILABLE
[   95.115939246,5] P9 DD2.20 detected
[   95.117743264,5] CHIP: Chip ID 0000 type: P9N DD2.2
[   95.120091220,5] P9 DD2.20 detected
[   95.122411855,5] CHIP: Chip ID 0008 type: P9N DD2.2
[   95.125423858,5] PLAT: Using Super����   ��NM�127329678,7] LPC: Routing irq 9, policy: 0 (r=1)
[   95.127331545,7] LPC: SerIRQ 9 using route 0 targetted at OPAL
[   95.133185923,5] PLAT: Detected p9dsu platform
[   95.135506823,5] PLAT: Detected BMC platform SMC
[   95.166128817,5] CPU: All 124 processors called in...
[   97.262675227,5] FLASH: Found system flash:  id:0
[   97.264527993,5] BT: Interface initialized, IO 0x00e4
[   97.267346676,7] LPC: Routing irq 10, policy: 0 (r=1)
[   97.267347940,7] LPC: SerIRQ 10 using route 1 targetted at OPAL
[   98.378361378,3] STB: container NOT VERIFIED, resource_id=4 secureboot not yet initialized
[   99.166388459,5] NVRAM: Size is 576 KB
[   99.378792169,5] STB: Found ibm,secureboot-v2
[   99.380272733,5] STB: secure mode off (FORCED by nvram)
[   99.383117382,5] STB: trusted mode off
[   99.385238454,5] OPAL: Using OPAL UART console
[   99.385250513,7] LPC-MBOX: Sending BMC interrupt
[   99.389503771,7] LPC: Routing irq 4, policy: 0 (r=1)
[   99.389505086,7] LPC: SerIRQ 4 using route 2 targetted at OPAL
[   99.395335909,5] SLW: Enabling: stop0_lite
[   99.397307105,5] SLW: Enabling: stop0
[   99.399434667,5] SLW: Enabling: stop1_lite
[   99.401563056,5] SLW: Enabling: stop1
[   99.402988649,5] SLW: Enabling: stop2_lite
[   99.405118795,5] SLW: Enabling: stop2
[   99.407244817,5] SLW: Disabling: stop4 in OPAL
[   99.409376778,5] SLW: Disabling: stop5 in OPAL
[   99.412718546,5] VAS: Initialized chip 0
[   99.415473533,5] VAS: Initialized chip 8
[   99.417226342,5] Found ibm,power9-nx
[   99.418656779,5] Found ibm,power9-nx
[  100.107935212,5] STB: IMA_CATALOG verified
[  100.453913823,5] STB: CAPP verified
[  100.481115388,5] PCI: Resetting PHBs and training links...
[  102.486595748,3] PHB#0000[0:0]: LINK: Timeout waiting for link up
[  102.496927132,3] PHB#0030[8:0]: LINK: Timeout waiting for link up
[  105.181938730,3] PHB#0000[0:0]: LINK: Timeout waiting for link up
[  105.192191640,3] PHB#0030[8:0]: LINK: Timeout waiting for link up
[  107.378499913,3] PHB#0000[0:0]: LINK: Timeout waiting for link up
[  107.394496952,3] PHB#0030[8:0]: LINK: Timeout waiting for link up
[  110.073627879,3] PHB#0000[0:0]: LINK: Timeout waiting for link up
[  110.076352920,3] PHB#0000[0:0]: Electrical link detected but won't train
[  110.084986117,3] PHB#0030[8:0]: LINK: Timeout waiting for link up
[  110.087574084,3] PHB#0030[8:0]: Electrical link detected but won't train
[  110.091684157,5] PCI: Probing slots...
[  110.149943671,5] PHB#0000:00:00.0 [ROOT] 1014 04c1 R:00 C:060400 B:01..ff SLOT=UIO Slot1 
[  110.153339008,5] PHB#0001:00:00.0 [ROOT] 1014 04c1 R:00 C:060400 B:01..05 SLOT=PLX 
[  110.157595517,5] PHB#0001:01:00.0 [SWUP] 10b5 8725 R:ca C:060400 B:02..05 SLOT=PLX up 
[  110.161855947,5] PHB#0001:02:01.0 [SWDN] 10b5 8725 R:ca C:060400 B:03..03 SLOT=UIO Slot2 
[  110.166109481,5] PHB#0001:03:00.0 [EP  ] 1000 00c9 R:01 C:010700 (           sas) LOC_CODE=UIO Slot2
[  110.170375393,5] PHB#0001:02:08.0 [SWDN] 10b5 8725 R:ca C:060400 B:04..04 SLOT=PLX switch 
[  110.174635723,5] PHB#0001:02:09.0 [SWDN] 10b5 8725 R:ca C:060400 B:05..05 SLOT=Onboard LAN 
[  110.178893572,5] PHB#0001:05:00.0 [EP  ] 8086 1589 R:02 C:020000 (      ethernet) LOC_CODE=Onboard LAN
[  110.183860666,5] PHB#0001:05:00.1 [EP  ] 8086 1589 R:02 C:020000 (      ethernet) LOC_CODE=Onboard LAN
[  110.188126346,5] PHB#0001:05:00.2 [EP  ] 8086 1589 R:02 C:020000 (      ethernet) LOC_CODE=Onboard LAN
[  110.193092050,5] PHB#0001:05:00.3 [EP  ] 8086 1589 R:02 C:020000 (      ethernet) LOC_CODE=Onboard LAN
[  110.198060186,5] PHB#0001:01:00.1 [EP  ] 10b5 87d0 R:ca C:088000 (system-peripheral) LOC_CODE=PLX
[  110.202321937,5] PHB#0001:01:00.2 [EP  ] 10b5 87d0 R:ca C:088000 (system-peripheral) LOC_CODE=PLX
[  110.207287315,5] PHB#0001:01:00.3 [EP  ] 10b5 87d0 R:ca C:088000 (system-peripheral) LOC_CODE=PLX
[  110.211550583,5] PHB#0001:01:00.4 [EP  ] 10b5 87d0 R:ca C:088000 (system-peripheral) LOC_CODE=PLX
[  110.215815371,5] PHB#0002:00:00.0 [ROOT] 1014 04c1 R:00 C:060400 B:01..01 SLOT=UIO Slot3 
[  110.220073417,5] PHB#0002:01:00.0 [EP  ] 1000 00c9 R:01 C:010700 (           sas) LOC_CODE=UIO Slot3
[  110.225039889,5] PHB#0003:00:00.0 [ROOT] 1014 04c1 R:00 C:060400 B:01..01 SLOT=Onboard SAS 
[  110.229299240,5] PHB#0003:01:00.0 [EP  ] 9005 028d R:01 C:010700 (           sas) LOC_CODE=Onboard SAS
[  110.233566083,5] PHB#0004:00:00.0 [ROOT] 1014 04c1 R:00 C:060400 B:01..02 SLOT=Onboard BMC 
[  110.237824993,5] PHB#0004:01:00.0 [ETOX] 1a03 1150 R:04 C:060400 B:02..02 LOC_CODE=Onboard BMC
[  110.242793713,5] PHB#0004:02:00.0 [PCID] 1a03 2000 R:41 C:030000 (           vga) LOC_CODE=Onboard BMC
[  110.247053892,5] PHB#0005:00:00.0 [ROOT] 1014 04c1 R:00 C:060400 B:01..01 SLOT=Onboard USB 
[  110.251314805,5] PHB#0005:01:00.0 [EP  ] 104c 8241 R:02 C:0c0330 (      usb-xhci) LOC_CODE=Onboard USB
[  110.256282501,5] PHB#0030:00:00.0 [ROOT] 1014 04c1 R:00 C:060400 B:01..ff SLOT=WIO Slot3 
[  110.260543276,5] PHB#0031:00:00.0 [ROOT] 1014 04c1 R:00 C:060400 B:01..01 SLOT=WIO-R Slot 
[  110.264798709,5] PHB#0031:01:00.0 [EP  ] 1000 00c9 R:01 C:010700 (           sas) LOC_CODE=WIO-R Slot
[  110.269064452,5] PHB#0032:00:00.0 [ROOT] 1014 04c1 R:00 C:060400 B:01..09 SLOT=PLX 
[  110.273323819,5] PHB#0032:01:00.0 [SWUP] 10b5 8725 R:ca C:060400 B:02..09 SLOT=PLX up 
[  110.277585720,5] PHB#0032:02:01.0 [SWDN] 10b5 8725 R:ca C:060400 B:03..03 SLOT=WIO Slot1 
[  110.281134434,5] PHB#0032:03:00.0 [EP  ] 1000 00c9 R:01 C:010700 (           sas) LOC_CODE=WIO Slot1
[  110.286103054,5] PHB#0032:02:08.0 [SWDN] 10b5 8725 R:ca C:060400 B:04..04 SLOT=PLX switch 
[  110.290364523,5] PHB#0032:02:09.0 [SWDN] 10b5 8725 R:ca C:060400 B:05..09 SLOT=WIO Slot2 
[  110.294621238,5] PHB#0032:01:00.1 [EP  ] 10b5 87d0 R:ca C:088000 (system-peripheral) LOC_CODE=PLX
[  110.298881628,5] PHB#0032:01:00.2 [EP  ] 10b5 87d0 R:ca C:088000 (system-peripheral) LOC_CODE=PLX
[  110.303847066,5] PHB#0032:01:00.3 [EP  ] 10b5 87d0 R:ca C:088000 (system-peripheral) LOC_CODE=PLX
[  110.308109936,5] PHB#0032:01:00.4 [EP  ] 10b5 87d0 R:ca C:088000 (system-peripheral) LOC_CODE=PLX
[  110.312374013,5] PHB#0033:00:00.0 [ROOT] 1014 04c1 R:00 C:060400 B:01..01 SLOT=WIO Slot4 
[  110.316637968,5] PHB#0033:01:00.0 [EP  ] 11f8 f117 R:06 C:010802 (  mass-storage) LOC_CODE=WIO Slot4
[  115.204442675,5] STB: BOOTKERNEL verified
[  116.050292900,5] IPMI: Resetting boot count on successful boot
[  116.052919690,5] INIT: Waiting for kernel...
[  116.055052084,5] INIT: 64-bit LE kernel discovered
[  116.057190055,5] OCC: All Chip Rdy after 0 ms
[  116.086475911,5] INIT: Starting kernel at 0x20011000, fdt at 0x30a96c70 364127 bytes
[  117.271673028,5] OPAL: Switch to little-endian OS

--== Welcome to Hostboot  ==--

  2.76615|secure|SecureROM valid - enabling functionality
  5.41444|Ignoring boot flags, incorrect version 0x0
  5.41805|Booting from SBE side 0 on master proc=00050000
  5.61507|ISTEP  6. 5 - host_init_fsi
  5.71884|ISTEP  6. 6 - host_set_ipl_parms
  5.96470|ISTEP  6. 7 - host_discover_targets
  7.75894|HWAS|PRESENT> DIMM[03]=F0F0000000000000
  7.75895|HWAS|PRESENT> Proc[05]=8800000000000000
  7.75896|HWAS|PRESENT> Core[07]=03FFCFCFCF0F0000
  7.95769|ISTEP  6. 8 - host_update_master_tpm
 15.52130|SECURE|Security Access Bit> 0xC000000000000000
 15.52131|SECURE|Secure Mode Disable (via Jumper)> 0x0000000000000000
 15.52262|ISTEP  6. 9 - host_gard
 16.08002|================================================
 16.08190|Error reported by prdf (0xE500) PLID 0x90000014
 16.08191|  PRD Signature            : 0x7002F 0xDD3F000E
 16.08733|  Signature Description    : pu.core:k0:n0:s0:p01:c23 (COREFIR[14]) Machine check and ME = 0 Err
 16.08908|  UserData1   : 0x0007002f00000101
 16.08908|  UserData2   : 0xdd3f000e00000000
 16.08909|------------------------------------------------
 16.09123|HWAS|Deconfig HUID 0x00070024, Physical:/Sys0/Node0/Proc1/EQ3/EX0/Core0
 16.09273|  Callout type             : Procedure Callout
 16.09274|  Procedure                : UNKNOWN: 0x11
 16.09274|  Priority                 : SRCI_PRIORITY_HIGH
 16.09275|------------------------------------------------
 16.09276|  Callout type             : Hardware Callout
 16.09277|  CPU id                   : 26
 16.09279|  Target                   : Physical:/Sys0/Node0/Proc1/EQ5/EX1/Core1
 16.09280|  Deconfig State           : NO_DECONFIG
 16.09280|  GARD Error Type          : GARD_NULL
 16.09281|  Priority                 : SRCI_PRIORITY_MED
 16.09282|------------------------------------------------
 16.09282|  
 16.09283|------------------------------------------------
 16.09284|  System checkstop occurred during runtime on previous boot
 16.09285|------------------------------------------------
 16.09285|  �
 16.09286|------------------------------------------------
 16.09286|  Hostboot Build ID: 
 16.09287|================================================
 16.09628|HWAS|FUNCTIONAL> DIMM[03]=F0F0000000000000
 16.09629|HWAS|FUNCTIONAL> Proc[05]=8800000000000000
 16.09631|HWAS|FUNCTIONAL> Core[07]=03FFCFCFC70F0000
 16.31779|ISTEP  6.10 - host_revert_sbe_mcs_setup
 16.32930|ISTEP  6.11 - host_start_occ_xstop_handler
 17.39409|ISTEP  6.12 - host_voltage_config
maheshsal commented 6 years ago

16.23569|================================================ 16.23993|Error reported by prdf (0xE500) PLID 0x9000003C 16.23993| PRD Signature : 0x7000E 0xDD3F000E 16.24269| Signature Description : pu.core:k0:n0:s0:p00:c14 (COREFIR[14]) Machine check and ME = 0 Err 16.24429| UserData1 : 0x0007000e00000101 16.24430| UserData2 : 0xdd3f000e00000000 16.24430|------------------------------------------------

This means we got hit two consecutive MCEs during very small window where MCE handler was handling first MCE error. It is expected that we get system xstop when this happens.

We should try following upstream commit in the petitboot kernel:

https://git.kernel.org/powerpc/c/75ecfb49516c53da00c57b9efe48fa

pridhiviraj commented 6 years ago

@maheshsal We have observed that the above patch is already in petitboot kernel.

mpe commented 6 years ago

This is not the right place for petitkernel bugs.

I think you should file them here: https://github.com/open-power/op-build/issues

pridhiviraj commented 6 years ago

@mpe As @mikey suggested this is a kind of upstream kernel issue with nr_cpus(less tested till now), even though it re-created with petitboot kernel. That's why i opened the issue here.

mpe commented 6 years ago

OK that's not very clear from the bug :)

mikey commented 6 years ago

yeah I figured this was likely an upstream bug rather than petitboot specific.

maheshsal commented 6 years ago

With non-SMT nr_cpus I see this with petitboot kernel:

[ 2.826369] Faulting instruction address: 0xc000000000022148 cpu 0x0: Vector: 300 (Data Access) at [c00000000111f7d0] pc: c000000000022148: lwarx_loop_stop+0x0/0x24 lr: c00000000004e1d4: power9_idle_type+0x4c/0x70 sp: c00000000111fa50 msr: 9000000000001033 dar: 0 dsisr: 80000 current = 0xc0000000010e9e80 paca = 0xc00000000fff0000 softe: 3 irq_happened: 0x01 pid = 0, comm = swapper/0 Linux version 4.16.7-openpower1 (jenkins@jenkins-vm) (gcc version 6.4.0 (Buildroot 2018.02.1-00006-ga8d1126)) #1 SMP Tue May 8 17:43:36 UTC 2018 enter ? for help [c00000000111fd40] c00000000004e1d4 power9_idle_type+0x4c/0x70 [c00000000111fd80] c000000000524ee8 stop_loop+0x38/0x48 [c00000000111fdb0] c000000000523058 cpuidle_enter_state+0x14c/0x20c [c00000000111fe00] c0000000000acf40 call_cpuidle+0x6c/0x74 [c00000000111fe20] c0000000000ad1e4 do_idle+0x1ec/0x200 [c00000000111fea0] c0000000000ad384 cpu_startup_entry+0x30/0x34 [c00000000111fed0] c00000000000cf8c rest_init+0xd8/0xe4 [c00000000111ff00] c000000001003d00 start_kernel+0x4fc/0x504 [c00000000111ff90] c00000000000ac70 start_here_common+0x1c/0x4ac 0:mon>

also on upstream kernel:

[ 2.101374] Unable to handle kernel paging request for data at address 0x00000000 [ 2.101413] Unable to handle kernel paging request for data at address 0x00000000 [ 2.101436] Faulting instruction address: 0xc00000000003442c [ 2.101487] Faulting instruction address: 0xc00000000003442c [ 2.101489] Oops: Kernel access of bad area, sig: 7 [#1] [ 2.101605] LE SMP NR_CPUS=2048 NUMA PowerNV [ 2.101652] Modules linked in: [ 2.101690] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.17.0-rc5mahesh #3 [ 2.101779] NIP: c00000000003442c LR: c0000000000a41f0 CTR: c0000000000343e8 [ 2.101858] REGS: c0000000016b77a0 TRAP: 0300 Not tainted (4.17.0-rc5mahesh) [ 2.101909] MSR: 9000000000001033 <SF,HV,ME,IR,DR,RI,LE> CR: 44000284 XER: 20040000 [ 2.101966] CFAR: c00000000003441c DAR: 0000000000000000 DSISR: 00080000 SOFTE: 1 [ 2.101966] GPR00: 0000000000000000 c0000000016b7a20 c0000000016b8b00 0000000000000005 [ 2.101966] GPR04: 0000000000000004 c0000000015356d0 0000000000000000 0000000000000000 [ 2.101966] GPR08: 0000000000000000 0000000000000000 c0000000016b4000 c000007fdee175a0 [ 2.101966] GPR12: c000000000ad2930 c000000001a30000 0000000000000000 0000000000000000 [ 2.101966] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 2.101966] GPR20: 0000000000000000 0000000000000001 0000000010004d9c 00000000100053ed [ 2.101966] GPR24: 0000000000000008 0000000000000008 0000000000000000 0000000000000008 [ 2.101966] GPR28: c0000000015ece30 00000000003003ff 0000000000300375 0000000000300375 [ 2.102464] NIP [c00000000003442c] lwarx_loop_stop+0x0/0x24 [ 2.102500] LR [c0000000000a41f0] __power9_idle_type+0x80/0xb0 [ 2.102543] Call Trace: [ 2.102563] [c0000000016b7a20] [c0000000016b7a70] init_stack+0x3a70/0x4000 (unreliable) [ 2.102616] [c0000000016b7d10] [c0000000000a41f0] __power9_idle_type+0x80/0xb0 [ 2.102669] [c0000000016b7d60] [c0000000000a4800] power9_idle_type+0x20/0x40 [ 2.102777] [c0000000016b7d80] [c000000000ad2970] stop_loop+0x40/0x5c [ 2.102839] [c0000000016b7db0] [c000000000aced34] cpuidle_enter_state+0xa4/0x400 [ 2.102947] [c0000000016b7e10] [c00000000014961c] call_cpuidle+0x4c/0x90 [ 2.103046] [c0000000016b7e30] [c000000000149c4c] do_idle+0x32c/0x3d0 [ 2.103136] [c0000000016b7ea0] [c000000000149f2c] cpu_startup_entry+0x3c/0x50 [ 2.103243] [c0000000016b7ed0] [c00000000000df90] rest_init+0xe0/0x100 [ 2.103324] [c0000000016b7f00] [c000000001084330] start_kernel+0x614/0x634 [ 2.103414] [c0000000016b7f90] [c00000000000ac7c] start_here_common+0x1c/0x4a0 [ 2.103519] Instruction dump: [ 2.103574] f86d09b8 39800000 4800038c 60000000 60000000 e8a28080 e8850000 7c232000 [ 2.103666] 40800008 4c0002e4 88ed09a9 e9cd09a0 <7de07028> 75e91000 40c2fe35 7def3878

maheshsal commented 6 years ago

Looks like we are hitting a bug in idle code for non-SMT nr_cpus.. in this case lwarx_loop_stop tries to access NULL pointer paca->core_idle_state. This is because with non-SMT number of nr_cpus the pnv_alloc_idle_core_states() function does not allocate core_idle_state for last core.

.Lhandle_deep_stop:
/*
 * Entering deep idle state.
 * Clear thread bit in PACA_CORE_IDLE_STATE, save SPRs to
 * stack and enter stop
 */
        lbz     r7,PACA_THREAD_MASK(r13)
        ld      r14,PACA_CORE_IDLE_STATE_PTR(r13)

lwarx_loop_stop:
        lwarx   r15,0,r14
        andis.  r9,r15,PNV_CORE_IDLE_LOCK_BIT@h
        bnel-   core_idle_lock_held
        andc    r15,r15,r7                      /* Clear thread bit */

        stwcx.  r15,0,r14
        bne-    lwarx_loop_stop
        isync

e.g. if nr_cpus=3 then nr_cores are set to 0 and the code never enter into for loop. With this when cpu wakes up from deep sleep, lwarx_loop_stop() tries to access pointer paca->core_idle_state and crashes.

static inline int cpu_nr_cores(void)
{
        return nr_cpu_ids >> threads_shift;
}

static void pnv_alloc_idle_core_states(void)
{
        int i, j;
        int nr_cores = cpu_nr_cores();
        u32 *core_idle_state;
[...]
    for (i = 0; i < nr_cores; i++) {
                int first_cpu = i * threads_per_core;
                int node = cpu_to_node(first_cpu);
                size_t paca_ptr_array_size;

                core_idle_state = kmalloc_node(sizeof(u32), GFP_KERNEL, node);
                *core_idle_state = (1 << threads_per_core) - 1;
                paca_ptr_array_size = (threads_per_core *
                                       sizeof(struct paca_struct *));

                for (j = 0; j < threads_per_core; j++) {
                        int cpu = first_cpu + j;

                        paca_ptrs[cpu]->core_idle_state_ptr = core_idle_state;
                        paca_ptrs[cpu]->thread_idle_state = PNV_THREAD_RUNNING;
                        paca_ptrs[cpu]->thread_mask = 1 << j;
                        if (!cpu_has_feature(CPU_FTR_POWER9_DD1))
                                continue;
                        paca_ptrs[cpu]->thread_sibling_pacas =
                                kmalloc_node(paca_ptr_array_size,
                                             GFP_KERNEL, node);
                }
        }
npiggin commented 1 year ago

The PowerNV idle code was rewritten since then and it doesn't seem to have this problem, although the smt4 catch/release workaround might access sibling pacas without checking they were allocated. Generally this is a bad configuration for PowerNV because the other SMT threads will be left spinning in skiboot and they can't go to 'stop' state because skiboot doesn't own the 0x100 vector, so the core will be forced into SMT modes even when idle.

Things like the smt4 catch/release workaround wouldn't work either, for the same reason.

We probably should just disallow nr_cpus % nr_threads != 0 for PowerNV for this reason. Actually nr_cpus= in any case is a bit risky because we could get MCEs or HMIs or SRESETs on those CPUs, and they'll be spinning burning more power than they need to. If the PowerNV platform implemented nr_cpus= by starting all secondaries but parking the missing ones in an offline state, it would be more realistic to allow the option.

cpu_nr_cores() is probably not a very well implemented API either. That at least should be taken out of cputhreads.h and moved into kvm internals.