1030 ips: SBE 1 should be automatically switched if the SBE 0 is broken

lxwinspur commented 1 year ago

The current logic is: If the BMC reboot fails three times, it will automatically switch to SBE 1 (this logic considers that SBE 0 is broken)

In fact, we encountered a phenomenon: When the BMC executes host power on, it is found that SBE 0 is broken. The normal logic is that the BMC should automatically restart and try three times. If it fails, it will automatically switch to SBE1. But when the bmc fails to power on for the first time, the bmc will be stuck after the SBE 0 startup fails, and the bmc will not be automatically restarted, so the BMC reboot will not be executed, which will not automatically switch to SBE 1

Is this a problem?

lxwinspur commented 1 year ago

@mzipse @geissonator @ojayanth FYI

ojayanth commented 1 year ago

Autoreboot is based on the policy , Should be true to initiate auto reboot during boot window. root@xxxx:~# busctl get-property mapper get-service /xyz/openbmc_project/control/host0/auto_reboot /xyz/openbmc_project/control/host0/auto_reboot xyz.openbmc_project.Control.Boot.RebootPolicy AutoReboot b true

Also need to look the host reboot counter value, by default it is three. @geissonator can comment on the behaviour of this . upstream was got support update this via Redfish API incase value is not setting correctly.

lxwinspur commented 1 year ago

Autoreboot is based on the policy , Should be true to initiate auto reboot during boot window. root@xxxx:~# busctl get-property mapper get-service /xyz/openbmc_project/control/host0/auto_reboot /xyz/openbmc_project/control/host0/auto_reboot xyz.openbmc_project.Control.Boot.RebootPolicy AutoReboot b true

Yes, I enabled auto_reboot and this problem still exists.

Also need to look the host reboot counter value, by default it is three. @geissonator can comment on the behaviour of this . upstream was got support update this via Redfish API incase value is not setting correctly.

geissonator commented 1 year ago

Please provide a bmc dump, or at least a journal so we can see what's going on. Reboot policy is only utilized if we get far enough into the boot.

lxwinspur commented 1 year ago

Please provide a bmc dump, or at least a journal so we can see what's going on. Reboot policy is only utilized if we get far enough into the boot.

Related logs and dump files are at https://github.com/ibm-openbmc/openbmc/issues/263

geissonator commented 1 year ago

@lxwinspur I took at look at the logs, it appears you aren't testing with the latest 1030.ips code? I put a fix for the "why do we not switch to sbe side 1" issue up via https://github.com/ibm-openbmc/phosphor-state-manager/commit/39d5673d6e8bedd12ac34e5b034d7abd2b939e03 and I verified that bump is in the latest version of meta-phosphor/recipes-phosphor/state/phosphor-state-manager_git.bb in the 1030.ips but I don't see the new traces I added for that in the journal data from #263?

lxwinspur commented 1 year ago

@geissonator

it appears you aren't testing with the latest 1030.ips code?

No, For this issue, I am based on the latest 1030.ips branch test(9a5e35fe9c1dbe8f278e728819abbf8e9c1f82ef)

geissonator commented 1 year ago

Hmm, I'm not sure what's going on then @lxwinspur, if you look at my commit in https://github.com/ibm-openbmc/phosphor-state-manager/commit/39d5673d6e8bedd12ac34e5b034d7abd2b939e03 you can see the change I made to the log when that script does a quiesce. Your journal showed the older log (without the "and host crashed"). Please double check your level of firmware and maybe look at that script, host-reboot, on your system to ensure it has the new logic.

lxwinspur commented 1 year ago

@sampmisr FYI

lxwinspur commented 1 year ago

After updating and using the following solution, the problem is solved

https://github.com/ibm-openbmc/openbmc/commit/2a0c1837053f01c748d838b72185073dd75baf07

ibm-openbmc / openbmc

1030 ips: SBE 1 should be automatically switched if the SBE 0 is broken #265