open-power / hostboot

System initialization firmware for Power systems
Apache License 2.0
74 stars 97 forks source link

HTMGT failed to reload occ image during runtime for P10. #242

Closed Theo0208 closed 3 months ago

Theo0208 commented 4 months ago

When OCC encounters an error during runtime, it sends an interrupt to the Host to notify HTMGT to handle the error. HTMGT will restart OCC when handling the error, and the OCC image needs to be reloaded before restarting. According to the description in the code, there are three sources of OCC imge: 1.Inside HBRT reserved memory 2.Inside the HBRT code load itself 3.Fetched from the service processor But through the given three sources, the OCC image cannot be found. I want to know why, and whether P10 still supports reload OCC image during runtime? Thanks a lot.

dcrowell77 commented 4 months ago

OCC reload is definitely supported in P10. The data will come from a lid on the BMC (like all of the images we use). I suspect you are hitting some kind of opal-specific limitation since it is working for the phyp-based systems we use. Does your opal implementation support the mctp interfaces that hbrt relies on? If you show me the hbrt trace log I can probably make a decent guess at the issue.

Theo0208 commented 4 months ago

Thank you for your reply. In function "loadLid()" of P10 code, there are three branches to reload OCC, as shown below: if(iv_isLidInHbResvMem) {... ...} else if(iv_isLidInVFS) {... ...} else if( g_hostInterfaces->lid_load ) {... ...} I would like to confirm which branch OPAL should choose to reload OCC.

dcrowell77 commented 4 months ago

We haven't done any work with Opal in P10 so I can't really say what the correct option is. I'm pretty sure we are using the lid_load() path under Phyp but I'll need to confirm that tomorrow. I have no idea if Opal supports that interface or not, I don't think they did in P9.

dcrowell77 commented 4 months ago

I confirmed we use the lid_load interface to load the OCC lid in the phyp path.

[TB] 00000006651B0137 UTIL:UtilLidMgr::loadLid\n [TB] 00000006651B2203 UTIL:UtilLidMgr::loadLid> Calling lid_load(0x81E00430)\n [TB] 0000000665495316 UTIL:UtilLidMgr::loadLid> size=847872, ptr=0xffffff0001e00010\n

Theo0208 commented 4 months ago

Okay,thanks a lot for your help.

Grubby0624 commented 4 months ago

Excuse me, I understand that your commit(Change-Id: I13c558dc243ef4e3ea8658b6cb820d26c637c6a9) is the reason why OPAL cannot load OCC images:

  1. Remove the rt_pnor module from RUNTIME-MODULES: src/makefile==>- RUNTIME-MODULES+=pnor_rt #~/src/usr/pnor/runtime/
  2. In loadLid(), the method of obtaining lid through rt_pnor has been removed: src/usr/util/runtime/utilidmgr-rt. C==>- else if (iv_isLidInPnor) And this approach should be the method of loading OCC image in P9 OPAL. To solve this problem on the OPAL platform, I have two ideas:
    1. Remove this commit from the hostboot
    2. Develop a lid_load function for opal prd I prefer the first method, but I'm not sure why you made this change and the difficulty of adding rt_pnor back in
dcrowell77 commented 4 months ago

I don't think you will be able to just remove that commit. There has been 4 years of P10 development since then. As stated above, I have no idea what OPAL's functionality is. If it supports pldm file i/o then you would be able to fetch the lid via that path. That is the same path Hostboot uses during the IPL.

I'm not sure why you made this change and the difficulty of adding rt_pnor back in

The openbmc stack COMPLETELY CHANGED from P9->P10. Every interface that we used in P9 to access "pnor" is gone in P10. Everything uses PLDM now. This commit is basically removing all of the legacy support before we added the required changes. OPAL will need to support PLDM to communicate with the BMC in P10.

Grubby0624 commented 4 months ago

Sorry, I made a mistake in my previous judgment. I checked that P9 should have cached the OCC image in "hb reserved memory", but P10 deleted the relevant logic in this commit: Ieaaea2a7fcffbca720b69e8ba9079abb0e1a8865. I would like to ask for advice,What are the benefits of "The only thing we need to load and verify now is the payload itself." I want to know? In addition, OPAL can already get the lid from BMC via pldm-fileio. If necessary, I can implement the load_lid function

dcrowell77 commented 4 months ago

What are the benefits of "The only thing we need to load and verify now is the payload itself."

That commit in particular was related to a performance improvement. Hostboot uses a very slow per-page file i/o path to load lids. PHYP has the ability to use dma with the BMC to load lids much quicker. So now we just let PHYP load the lids on-demand when they are needed rather than preloading them into memory during the IPL.

Theo0208 commented 4 months ago

Excuse me, in the src/usr/runtime/populate_hbruntime.C file of the P10 Hostboot module, the “populate_HbRsvMem” function contains the following code: else if(TARGETING::is_sapphire_load()) { l_hbrtPsuAddr = l_prevDataAddr -SBEIO::SbePsu::MAX_HBRT_PSU_OP_SIZE_BYTES; } I would like to know why the "SBEIO::SbePsu::MAX_HBRT_PSU_OP_SIZE_BYTES" is not passed into the "ALIGN_X" function for further alignment calculations when calculating the variable l_hbrtPsuAddr? Could I pass the "SBEIO::SbePsu::MAX_HBRT_PSU_OP_SIZE_BYTES" into the "ALIGN_X" function before calculating the value of l_hbrtPsuAddr?

dcrowell77 commented 4 months ago

I don't see any reason why you couldn't go thorough ALIGN_X, it should just be a NOOP since MAX_HBRT_PSU_OP_SIZE_BYTES is explicitly defined as a single page already.

Theo0208 commented 4 months ago

Okay,thanks.