ibm-openbmc / openbmc

https://github.com
Other
19 stars 51 forks source link

1030.ips: SMS: Host cannot boot to PHYP after power on #277

Closed lxwinspur closed 1 year ago

lxwinspur commented 1 year ago

Problem Description

Using Hostfw compiled by IPS, the machine cannot boot to PHYP, and the Host console stays at the C7004091 interface

image image

Host Console:

image

Also, An error will be displayed when collecting system dump and BMC dump:

If you don't see any dumps, be sure you have the appropriate policies enabled

image image

lxwinspur commented 1 year ago

@Emy-inspur FYI

lxwinspur commented 1 year ago

@mzipse Please let your host team take a look at this issue. thanks!

mzipse commented 1 year ago

@dhruvibm , can you comment on how IPS might debug the dump fails? Perhaps logging in via the Service Account and then what to look for?

mzipse commented 1 year ago

To debug the PHYP hang I'm wondering if the IPS team is familiar with using isteps? I believe Istep mode is similar to P9 but you could then stop at the istep just before the hang and then look at what HDAT data is getting passed to PHYP.

mzipse commented 1 year ago

Sorry, didn't mean to close this issue.

lxwinspur commented 1 year ago

@dhruvibm The value of hdatSystemVendorName printed is the combined value of F5 and F6 before entering the PHYP.

lili-lilili commented 1 year ago

@mzipse Now, the problem is that we cannot build a firmware that can boot phyp success, even if we do not add the sms-related modifications. So, i hope you can provide a detailed explanation on how to build a firmware that can successfully boot phyp based on open source code.

edwin-wang commented 1 year ago

@mzipse @dhruvibm From the discussion here, IPS knows how to debug using isteps. But seems system hung after handed over to PHYP. Could you help confirm the value is correct if splice F5 and F6 together for hdatSystemVendorName?

lili-lilili commented 1 year ago

@edwin-wang @mzipse @dhruvibm Let's synchronize the information.

  1. We add SMS modify in Hostboot and BMC, and we print the sms value in hdat when hostboot execute, it looks good. But when the machine boot to PHYP, the system hang.
  2. We did a test: build op-build 1030 without SMS modify and replace most of hostboot lid(HBB, HBBL, HBEL, HBI, HBICORE_SYMS, HBOTSTINGFILE, HBRT, HBRT_RT, HB_VOLATILE, HBD-4U, HBD_RT-4U, HBD_RT-4U) into the hostfw IBM send to IPS,use this hostfw to boot machine,the system hang when phyp booting.
  3. We not familiar with execute a signle istep by BMC, but this does not prevent us from debugging.

I don't think it's necessarily the SMS modification that caused the problem, but it's probably the method we built hostfw is incorrect .

dcrowell77 commented 1 year ago

A true "hang" is rare so I suspect there is a TI or checkstop happening. Can we get a BMC dump? Or failing that at least the peltool output of all visible logs. There should be a log that includes the TI SRC and/or the checkstop reason.

dcrowell77 commented 1 year ago

We noticed A7004714 in the output.

From https://www.ibm.com/docs/en/power8/0000-REF?topic=POWER8_REF/p8eai/A7004714.html

Explanation Platform LIC has detected a new VPD card. Response

The new VPD card requires new activation codes. Enter the new activation codes.

This could be preventing PHYP standby. You will need to apply the appropriate license keys on your system.

Emy-inspur commented 1 year ago

@dcrowell77 Thank you for your answer. The event logs and BMC dump we obtainted are as follows, Please take a look. https://github.com/Emy-inspur/SMS-Logs.git Also, how can I obtain or generate the appropriate license keys?

mzipse commented 1 year ago

Email sent to Xujin on the procedure for clearing license keys and using IPS activation codes.

Also, per feedback from Uma, you should consider setting the time to aid in future debugging using dumps. And lastly, we noticed some resources have been guarded out. You should consider clearing guard (guard -r).

neslop commented 1 year ago

An A7004714 does NOT necessarily require ANY action. It only means when phyp came up, there was no COD information (activations) found to be stored in the server yet -- at the very WORST, we'd come up with 1 processor and some memory available -- the 4714 is NOT an IPL-blocker.

I'm sure there will be more discussion at the meeting, but likely something else is not satisfied, thus the IPL cannot go from C7004091 to "Standby/Runtime". Absence of COD activations alone will NOT block an IPL from completing.

jaypadath commented 1 year ago

There was a request from Travis from PHYP team to have one HDAT change to enaable the flag System Security Settings (it 2 = 1: Platform security overrides allowed).

Below is the change to be applied for the same: *** hdatiplparms.C 736 // by a service processor 737 this->iv_hdatIPLParams->iv_sysParms.hdatSysSecuritySetting = 0; 738

---> New two lines to be added 739 // Set the Bit 2 for Platform security overrides 740 this->iv_hdatIPLParams->iv_sysParms.hdatSysSecuritySetting != 0x20000;

lili-lilili commented 1 year ago

Do you mean to add this line? this->iv_hdatIPLParams->iv_sysParms.hdatSysSecuritySetting |= 0x2000;

If I understand correctly, it seems to have no effect. Same as previous tests: The host console stop at C7004091,and there is no output on Hypervisor console.

jaypadath commented 1 year ago

I believe IBM team figured out some other issue with the HBRT lids. So doing the above HDAT change makes no sense now. Please ignore my HDAT fix suggestion.

lili-lilili commented 1 year ago

Yes, i get the email. Thank you for your reply anyway.

mzipse commented 1 year ago

@lxwinspur , I think we can close this issue now, correct? With an updated step dealing with the LIDs in the Host firmware build process, I think this was resolved.