open-power / hostboot

System initialization firmware for Power systems
Apache License 2.0
75 stars 97 forks source link

i2c error leads to unbootable machine #71

Open ghost opened 7 years ago

ghost commented 7 years ago

On Habanero (with TPM and workaround), using this build:

FRU Device Description : System Firmware (ID 43)
 Product Name          : OpenPOWER Firmware
 Product Version       : open-power-habanero-eaf699f-dirty
 Product Extra         :        buildroot-b8e3874
 Product Extra         :        skiboot-5.4.0-opdirty
 Product Extra         :        hostboot-94aeacf-opdirty
 Product Extra         :        linux-4.4.30-openpower1-opdirty-ac76873
 Product Extra         :        petitboot-v1.3.1-opdirty-853cc3d
 Product Extra         :        habanero-xml-5565b8f-opdirty-23971ac
 Product Extra         :        oc

After running ./ci/source/op_opal_fvt.py OpalDrivers.test_i2c_driver test from op-test-framework (and thus causing awful bugs to be hit and the machine dying), it fails to boot. It never switches to golden side either. I've done bmc cold reset and powered everything off and back on again

The boot log is: hab4-boot.txt

The road to recovery is to erase the HBEL partition.

The HBEL partition is (not really a PNG, but literally the only way I could beat github into submission) hbel-not-really-png

dcrowell77 commented 7 years ago

I'm not familiar with that test, how does it inject the i2c errors? Are they being injected constantly as we boot?

The system not getting flipped to the golden side would be a BMC issue. There is nothing Hostboot can do about that.

It seems like we have an ECC error in the HBEL partition but a quick skim didn't see anything. I'll have to rig up a real de-ecc tool (the one I have doesn't tell you where the error is...) to see what is going on. My first guess is that we did an erase but got interrupted before we could rewrite it to valid ECC.

ghost commented 7 years ago

Dan notifications@github.com writes:

I'm not familiar with that test, how does it inject the i2c errors? Are they being injected constantly as we boot?

It does i2cdetect and probes the TPM, which hits a TPM errata where it goes off into the weeds and hogs the i2c bus.

To work around this, I've put in a quirk for the TPM in skiboot where it'll not send certain I2C commands to the TPM and instead just fake a response.

At this point, you really do have to power cycle the box to fix it.

The downside is the Hostboot error messages in this case are... not clear at all.

-- Stewart Smith OPAL Architect, IBM.

bofferdn commented 6 years ago

Currently there is no Habanero shipping configuration that supports TPM. TPMs are in some machines due to the p8 secure boot development effort, but in theory are not being supported. P9 B&S systems were supported with TPM, but to my knowledge the RPQ for that never shipped to a customer.

Since P8, we launched into P9 and found a host of issues with the TPM, and had to put many fixes into the hostboot -and- Phyp firmware stack to deal with this part, none of which have been backported to P8, but which would not likely help the situation where a testcase is intentionally injecting the errata conditions.

In p9 there is an effort underway to call out all the hardware on a hung bus for replacement. In this situation on p9, the TPM would be called out as a part to be serviced. There is not much else we can do beyond this. I'll get with Stuart to understand exactly what an acceptable solution would be in this case,,. and weigh whether p8 solution is worth the time/effort.