open-power / hostboot

System initialization firmware for Power systems
Apache License 2.0
75 stars 97 forks source link

Hostboot boot fail without inserting TPM card #233

Closed lili-lilili closed 12 months ago

lili-lilili commented 1 year ago

We are testing the tpm less feature. When tpm card removed from slot, hostboot will boot fail.

Here is the log: https://github.com/lili-lilili/share_files/blob/main/tpm-less-halt-in-hostboot.txt

I think this issue is due to hostboot attempting to extend PCR through PSU, but the TPM has been deconfigured by SBE.

lili-lilili commented 1 year ago

Here is the log when PSU timeout:

42.75195|SBE|<<updateProcessorSbeSeeproms() 42.75272|PNOR|>>PNOR::validateAltMaster 42.75343|PNOR|<<PNOR::validateAltMaster : RC=0 42.75344|PNOR|>>PNOR::copyPnorPartitionToAlt(29=EECACHE)> 42.75414|PNOR|I>PNOR::copyPnorPartitionToAlt> There are no alternate PNORs to update 42.75415|PNOR|<<PNOR::copyPnorPartitionToAlt(29=EECACHE)> 42.75486|SBE|<<getSbeBootSeeprom(): o_bootSide=0x0 o_mSide=0x0 (reg=0x0, tgt=0x50000) 42.75487|SBEIO|psudd: Sending Req = 000000000009D703 0000000000000100 00000000E1A4BC80 0000000000000000 to 00050000 42.75491|INTR|IntrRp::msgHandler(): LSI Interrupt Detected 42.75492|INTR|IntrRp::msgHandler() lsiIntStatus 0x0004000000000000 42.75493|INTR|IntrRp::msgHandler() External Interrupt found for pir: 0x0,interrupt type: 13 42.75586|SBEIO|psudd: SbePsu::msgHandler got MSG_INTR message 42.75608|INTR|IntrRp::completeInterruptProcessing() Removing pending interrupt for pir: 0x0,interrupt type: 13 42.76950|SBE|getSeepromSideVersionViaSPI(): eccStatus=0, sizeof o_info/sI=80, sI_ECC=90, origin golden=0 42.76955|SBE|<<getSbeBootSeeprom(): o_bootSide=0x0 o_mSide=0x0 (reg=0x0, tgt=0x50001) 42.76958|SBEIO|sendSetFFDCAddr: >>sending set FFDC address from HB -> SBE for Proc 0x1, FFDC at 0x00000000e0a11000 for 8192, Comm at 0x0000000000000000 for 0 42.76959|SBEIO|psudd: Sending Req = 00000000000AD704 0000000000000000 00000000E0A11000 0000000000000000 to 00050001 151.22952|INTR|---ESB States--- 151.22953|INTR|IC LSI ESB = 0x0000000000000000 151.22953|INTR|Processor 0x50000 151.22954|INTR| SRC: 00 State: OFF 151.22955|INTR| SRC: 01 State: OFF 151.22956|INTR| SRC: 02 State: OFF 151.22956|INTR| SRC: 03 State: OFF 151.22957|INTR| SRC: 04 State: OFF 151.22957|INTR| SRC: 05 State: OFF 151.22958|INTR| SRC: 06 State: OFF 151.22958|INTR| SRC: 07 State: OFF 151.22959|INTR| SRC: 08 State: OFF 151.22959|INTR| SRC: 09 State: OFF 151.22960|INTR| SRC: 10 State: OFF 151.22960|INTR| SRC: 11 State: OFF 151.22961|INTR| SRC: 12 State: OFF 151.22961|INTR| SRC: 13 State: RESET 151.22962|INTR|---LSI Sources--- 151.22962|INTR|Processor 0x50000 151.22963|INTR| lsiIntStatus : vAddr=0x0000021800000080 Value=0x0000000000000000 151.22965|INTR| lsiIntLevel : vAddr=0x0000021800000078 Value=0x0000000000000000 151.22965|INTR|---PSIHB Info--- 151.22966|INTR|Processor 0x50000 151.22967|INTR| PSIHB Ctrl/Status Reg : vAddr=0x0000021800000020 Value=0x10f0100000000000 151.22968|INTR| PSIHB Error/Status Reg : vAddr=0x0000021800000028 Value=0x0000fff0fff0f800 151.22970|INTR| PSIHB Dbg Setting Reg : vAddr=0x0000021800000038 Value=0x0000000000000000 151.22971|INTR| PSIHB Interrupt Control Reg : vAddr=0x0000021800000058 Value=0x8000000000000000 151.24308|SBEIO|psudd: Error: PSU Timeout and no FFDC present

dcrowell77 commented 1 year ago

This behavior is unexpected, we're discussing what might be going wrong. If you want to get past this temporarily you could comment out everything from "// Unlock the TPM SPI engine" through "// 7) Send TPM Extend Mode chip-op with Exit crontol flag" in call_proc_build_smp.C.

What version of Hostboot are you using? If you aren't using something pretty recent from master-p10 branch you should check if there were any changes in this flow. I know it was a post-GA change but didn't look into the details. The same is true for the SBE tree as well. I would suggest retrying with the most recent level of SBE code to see if there was a possible fix.

dcrowell77 commented 1 year ago

Did you have any luck skipping the logic I suggested?

lili-lilili commented 1 year ago

The following modify seems work. But we still works for the root cause.

diff --git a/src/usr/isteps/istep10/call_proc_build_smp.C b/src/usr/isteps/istep10/call_proc_build_smp.C
index a6b2845..ccb6c02 100644
--- a/src/usr/isteps/istep10/call_proc_build_smp.C
+++ b/src/usr/isteps/istep10/call_proc_build_smp.C
@@ -842,7 +842,7 @@ void* call_proc_build_smp (void *io_pArgs)
         }

         // d) Call HWP to enable SMP fabric (pb hotplug) (legacy path)
-        if (!l_sbeExtend)
+        if (!(l_sbeExtend && TRUSTEDBOOT::functionalPrimaryTpmExists() ))
         {
             TRACFCOMP( ISTEPS_TRACE::g_trac_isteps_trace,
                     "Call p10_build_smp(SMP_ACTIVATE_SWITCH)" );
-- 
2.17.1
lili-lilili commented 1 year ago

When TPM less,the spi for tpm unlock fail. "Register 0xC0082" is a hardware lock for SPI controller? It seems that the xscom operation to write to the register has failed.

Any suggestions?

40.91134|XSCOM|xscomDoOp - RESOURCE OCCUPIED LOOP Cntr = 100000: OpType 0x0000000000000001, Address 0xC0082, MMIO Address 0x600410, HMER=0000000000000000
 40.96664|XSCOM|xscomDoOp - RESOURCE OCCUPIED LOOP Cntr = 200000: OpType 0x0000000000000001, Address 0xC0082, MMIO Address 0x600410, HMER=0000000000000000
 41.02193|XSCOM|xscomDoOp - RESOURCE OCCUPIED LOOP Cntr = 300000: OpType 0x0000000000000001, Address 0xC0082, MMIO Address 0x600410, HMER=0000000000000000
 41.07726|XSCOM|xscomDoOp - RESOURCE OCCUPIED LOOP Cntr = 400000: OpType 0x0000000000000001, Address 0xC0082, MMIO Address 0x600410, HMER=0000000000000000
 41.13230|XSCOM|xscomDoOp - RESOURCE OCCUPIED LOOP Cntr = 500000: OpType 0x0000000000000001, Address 0xC0082, MMIO Address 0x600410, HMER=0000000000000000
 41.18760|XSCOM|xscomDoOp - RESOURCE OCCUPIED LOOP Cntr = 600000: OpType 0x0000000000000001, Address 0xC0082, MMIO Address 0x600410, HMER=0000000000000000
 41.18760|XSCOM|Giving up, we're still locked on 0x00000000000C0082...
 41.18761|XSCOM|E>XSCOM status error HMER: 00c0010000000000 ,XSComStatus = 1, Addr=c0082
 41.18777|ERRL|E>Error created : PLID=90000F27, EID=90000F27, RC=0401, Mod=07, Userdata=00C0010000000000 000E03FC00600410, Sev=UNRECOVERABLE
 41.18779|XSCOM|collectXscomFFDC: XSCOM COLLECT FFDC STARTED
 41.18797|XSCOM|>>resetScomEngine
 41.18799|XSCOM|I>resetScomEngine: RESET SUCCEEDED: XscomAddr = 0009001800000000, VAddr=23000000000
 41.18808|XSCOM|I>resetScomEngine: RESET SUCCEEDED: XscomAddr = 0009001200000001, VAddr=20800000000
 41.18809|XSCOM|I>resetScomEngine: RESET SUCCEEDED: XscomAddr = 0009001300000001, VAddr=20800000000
 41.18810|XSCOM|<<resetScomEngine
 41.18811|ERRL|>>addProcedureCallout(0x55, 0x6)
 41.18812|ERRL|>>addHwCallout(0x00050000 0x1 0x2 0x0)
 41.18813|FAPI_I|plat_hw_access.C: platPutScom: deviceWrite returns error!
 41.18814|FAPI_I|plat_hw_access.C: platPutScom failed - Target pu:k0:n0:s0:p00, Addr 00000000000C0082
 41.18816|FAPI_I|p10_sbe_spi_cmd.C: putScom( i_handle.target_chip, i_handle.base_addr | SPIM_CONFIGREG1, data64) failed.
 41.18816|FAPI_I|plat_utils.C: rcToErrl: PLAT error: 0x02000001
 41.18821|ISTEPS_TRACE|call_proc_build_smp: Failure trying to release TPM SPI atomic lock, proc 0x00050000. Returning errorlog, reason=0x401
lili-lilili commented 1 year ago

update more:

  1. firmware version is fw1030
  2. If comment out following code, machine can boot up success too.
            // Unlock the TPM SPI engine
            SpiControlHandle l_spi_handle(l_fapi2_boot_proc,
                                          TPM_SPI_ENGINE,
                                          TPM_SPI_USE_CHIP_SELECT_0,
                                          true); // i_pib_access
            FAPI_INVOKE_HWP( l_errl,
                             spi_master_unlock,
                             l_spi_handle,
                             HOSTBOOT_PIB_MASTER_ID );
            if (l_errl != nullptr)
            {
                TRACFCOMP( ISTEPS_TRACE::g_trac_isteps_trace,
                    "call_proc_build_smp: Failure trying to release TPM SPI atomic lock, proc 0x%.8X. Returning errorlog, reason=0x%x",
                    TARGETING::get_huid(l_bootProc),
                    l_errl->reasonCode() );
                break;
            }
dcrowell77 commented 1 year ago

"Register 0xC0082" is a hardware lock for SPI controller? It seems that the xscom operation to write to the register has failed.

Correct, that is the atomic lock register. The scom failure is how the lock manifests in the hardware. The way the lock works is that if one PIB master has it then all other masters get a scom error if they try to write it. So the issue here is that the SBE has incorrectly held the lock.

It is interesting that you only have to remove the lock call and not the other chipops.

FYI - Some internal testing is happening now with changes to the SBE code to get this path to work. We don't have it completed yet.

IlyaSmirnov91 commented 1 year ago

We will push a fix for this issue in Hostboot code. The fix is to skip attempting to unlock the TPM SPI engine if there is no functional primary TPM on the system. I will update this issue once the fix goes in.

IlyaSmirnov91 commented 1 year ago

@lili-lilili , the fix for this issue went into https://github.com/open-power/op-build/pull/5712

IlyaSmirnov91 commented 1 year ago

Or this commit: https://github.com/open-power/hostboot/commit/db85edd4dabf071e5131369e1f6d24b54406cb0a

lili-lilili commented 1 year ago

@lili-lilili , the fix for this issue went into open-power/op-build#5712

I just came back from the National Day holiday. I will test this commit later.

Can you provide more information on this commit? As Dan said, SBE has incorrectly held the lock; But, i can't find out the src code that SBE try to held the lock.

lili-lilili commented 1 year ago

@lili-lilili , the fix for this issue went into open-power/op-build#5712

I just came back from the National Day holiday. I will test this commit later.

Can you provide more information on this commit? As Dan said, SBE has incorrectly held the lock; But, i can't find out the src code that SBE try to held the lock.

Let me confirm which modify fix the issue? Hostboot? or Hostboot binaries?

lili-lilili commented 1 year ago

Or this commit: db85edd

This commit test pass.

akhis023 commented 12 months ago

SBE has incorrectly held the lock; But, I can't find the src code that SBE tried to hold the lock. SBE had not held the lock. The lock was released properly by SBE.

Below is the explanation of why HB saw a PIB RC while reading the SPI TPM lock: 1) SBE started reading device ID and this operation never got to completion due to TPM non-availability 2) The above operation keeps SPI FSMs indefinitely busy 3) SBE sets TDP, which means Security is enabled for the SPI controller to not allow unsafe access 4) Unsafe access includes writing to Lock/Counter_config1 register while SPI FSMs are busy, which results in PIB resource busy RC (also reflected in p00 status register 0x0004020200000020 )

akhis023 commented 12 months ago

The SBE code that's causing the above issues: https://github.com/open-power/sbe/blob/master-p10/src/sbefw/measurement/sbemthreadroutine.C#L82

If this code is not called in the SBE Measurement flow, the HB issue will not be seen.