open-power / hostboot

System initialization firmware for Power systems
Apache License 2.0
75 stars 97 forks source link

SBE update failure #133

Closed madscientist159 closed 6 years ago

madscientist159 commented 6 years ago

We are the process of updating our Talos PNOR to use latest upstream components. Built PNOR is not updating the SBE as expected, and is failing to IPL in hostboot. This is with latest hostboot GIT master.

Still investigating cause.

  3.09910|secure|SecureROM valid - enabling functionality
  5.24891|Booting from SBE side 0 on master proc=00050000
  5.29564|ISTEP  6. 5 - host_init_fsi
  6.06609|ISTEP  6. 6 - host_set_ipl_parms
  6.33933|ISTEP  6. 7 - host_discover_targets
 10.66179|HWAS|PRESENT> DIMM[03]=0020000000000000
 10.66180|HWAS|PRESENT> Proc[05]=8800000000000000
 10.66181|HWAS|PRESENT> Core[07]=9500001848000000
 10.70352|ISTEP  6. 8 - host_update_master_tpm
 10.70476|SECURE|Security Access Bit> 0x0000000000000000
 10.70477|SECURE|Secure Mode Disable (via Jumper)> 0xC000000000000000
 10.70490|ISTEP  6. 9 - host_gard
 10.79127|HWAS|FUNCTIONAL> DIMM[03]=0020000000000000
 10.79129|HWAS|FUNCTIONAL> Proc[05]=8800000000000000
 10.79130|HWAS|FUNCTIONAL> Core[07]=9500001848000000
 10.79607|ISTEP  6.10 - host_revert_sbe_mcs_setup
 10.79711|ISTEP  6.11 - host_start_occ_xstop_handler
 11.56240|ISTEP  6.12 - host_voltage_config
 11.60575|ISTEP  7. 1 - mss_attr_cleanup
 13.29628|ISTEP  7. 2 - mss_volt
 13.35982|ISTEP  7. 3 - mss_freq
 13.72665|ISTEP  7. 4 - mss_eff_config
 14.27008|ISTEP  7. 5 - mss_attr_update
 14.28383|ISTEP  8. 1 - host_slave_sbe_config
 14.51119|ISTEP  8. 2 - host_setup_sbe
 14.51464|ISTEP  8. 3 - host_cbs_start
 14.53188|ISTEP  8. 4 - proc_check_slave_sbe_seeprom_complete
 18.55198|ISTEP  8. 5 - host_attnlisten_proc
 18.55275|ISTEP  8. 6 - host_p9_fbc_eff_config
 18.55867|ISTEP  8. 7 - host_p9_eff_config_links
 18.56507|ISTEP  8. 8 - proc_attr_update
 18.56634|ISTEP  8. 9 - proc_chiplet_fabric_scominit
 18.59373|ISTEP  8.10 - proc_xbus_scominit
 19.57732|ISTEP  8.11 - proc_xbus_enable_ridi
 19.58181|ISTEP  8.12 - host_set_voltages
 19.65429|ISTEP  9. 1 - fabric_erepair
 19.71062|ISTEP  9. 2 - fabric_io_dccal
 20.42056|ISTEP  9. 3 - fabric_pre_trainadv
 20.42308|ISTEP  9. 4 - fabric_io_run_training
 20.55745|ISTEP  9. 5 - fabric_post_trainadv
 20.56016|ISTEP  9. 6 - proc_smp_link_layer
 20.56608|ISTEP  9. 7 - proc_fab_iovalid
 20.83057|ISTEP  9. 8 - host_fbc_eff_config_aggregate
 20.83953|ISTEP 10. 1 - proc_build_smp
 20.98086|ISTEP 10. 2 - host_slave_sbe_update
 21.92908|System shutting down with error status 0x90000012
 21.94443|================================================
 21.94595|Error reported by initservice (0x0500) PLID 0x90000012
 21.94596|  Initialization Service launched a function and the task returned an error.
 21.94597|  ModuleId   0x01 BASE_INITSVC_MOD_ID
 21.94748|  ReasonCode 0x0506 WAIT_FN_FAILED
 21.94749|  UserData1  task id or task return code : 0x00000000000000ec
 21.94750|  UserData2  returned status from task : 0x0000000000000001
 21.94751|------------------------------------------------
 21.95054|  Callout type             : Procedure Callout
 21.95055|  Procedure                : EPUB_PRC_HB_CODE
 21.95056|  Priority                 : SRCI_PRIORITY_HIGH
 21.95057|------------------------------------------------
 21.95057|  host_slave_sbe_update
 21.95058|------------------------------------------------
 21.95058|  Hostboot Build ID:
 21.95059|================================================
dcrowell77 commented 6 years ago

Do you happen to have a Cronus debug connection? If not, we'll have to keep our eyes open for a fail here. Unfortunately this kind of fail is one of the most annoying to debug.

madscientist159 commented 6 years ago

@dcrowell77 Yes, we have Cronus. How should we proceed with debug?

madscientist159 commented 6 years ago

Downgrading hostboot to 1e784c03824d66dd76ee5effe16b55782c703599 appears to bypass this issue.

dcrowell77 commented 6 years ago

Downgrading hostboot to 1e784c0 appears to bypass this issue.

Which commit did you first notice the fail on?

Using our debug tools on top of Cronus, run all these from inside your Cronus session: export PROJECT_ROOT=/op-build/output/build/hostboot-/ export HB_DFRAME=$HOSTBOOTROOT/src/build/debug/ $HB_DFRAME/ecmd-debug-framework.pl --tool=Printk $HB_DFRAME/ecmd-debug-framework.pl --tool=Trace ...etc... See all of our debug tools under src/build/debug/Hostboot/

I suspect that you'll see some kind of exception in the Printk output.

madscientist159 commented 6 years ago

Failure is seen on GIT hash 739ec89c67cde105301ab9aa11adf2c420efa6eb

Thanks for the instructions, will see if I can get time to try it out shortly.

ghost commented 6 years ago
git bisect start
# good: [1e784c03824d66dd76ee5effe16b55782c703599] Handle early life PNOR fails in HBRT instead of hanging
git bisect good 1e784c03824d66dd76ee5effe16b55782c703599
# bad: [739ec89c67cde105301ab9aa11adf2c420efa6eb] When FSI initialized by SP only use enable reg for detection
git bisect bad 739ec89c67cde105301ab9aa11adf2c420efa6eb
# good: [744277d9a5c546340a011ea36a18471bd3cdcb85] Enhance p9_extract_sbe_rc
git bisect good 744277d9a5c546340a011ea36a18471bd3cdcb85
# bad: [18dba5172c7d022d5b5b119d758fe167868cb00d] PRD: getConnectedDimm support for MBA/MCA
git bisect bad 18dba5172c7d022d5b5b119d758fe167868cb00d
# good: [e84f5604125d704d098efbea74f8368060be593d] Ensure runtime lib is loaded for IPC_POPULATE_TPM_INFO_BY_NODE
git bisect good e84f5604125d704d098efbea74f8368060be593d
# bad: [cde4990515a7a190fca7a3eb9f722f74c12acdb2] Cleanup the fix for "zero length dump on single node systems".
git bisect bad cde4990515a7a190fca7a3eb9f722f74c12acdb2
# bad: [f5cd23d6c3be17356e0851ec5d5bb65cee48f15f] Mark Read-Only Partitions as Such
git bisect bad f5cd23d6c3be17356e0851ec5d5bb65cee48f15f
# first bad commit: [f5cd23d6c3be17356e0851ec5d5bb65cee48f15f] Mark Read-Only Partitions as Such

just confirming if that is indeed the commit that if reverted fixes the things.

ghost commented 6 years ago

Yep, it's f5cd23d6c3be17356e0851ec5d5bb65cee48f15f. If I revert that one commit, I can boot. Disable the revert and I cannot.

ghost commented 6 years ago

I posted something in the internal gerrit that backs out that commit... hopefully some Hostboot folk can point out how I'm incredibly wrong somehow. It looks like I'll have to bring this in as a patch in op-build for a bit though, as otherwise it breaks booting my Boston DD2.2 system.