Open pridhiviraj opened 6 years ago
After PNOR re-provision also system boot failing with
1.73616|System shutting down with error status 0x60F
1.73719|System shutting down with error status 0x90FF0001
1.49203|ECC error in PNOR flash in section offset 0x00031000
1.49665|System shutting down with error status 0x60F
1.49768|System shutting down with error status 0x90FF0002
PNOR level:
Product Name : OpenPOWER Firmware
Product Version : open-power-habanero-v2.0-33-gb536a49
Product Extra : buildroot-2018.02.2-7-gcb36c6d
Product Extra : skiboot-v6.0.1-27-g34e9c3c1edb3
Product Extra : hostboot-p8-d3025f5-p145344c
Product Extra : occ-p8-28f2cec
Product Extra : linux-4.16.13-openpower1-pa8348b9
Product Extra : petitboot-1.8.0
Product Extra : machine-xml-6a784
1.49203|ECC error in PNOR flash in section offset 0x00031000
If you look at the PNOR xml file you can see which section is corrupted. It appears to be the HBD section. https://github.com/open-power/pnor/blob/master/p8Layouts/defaultPnorLayoutWithGoldenSide.xml#L98
What happens in the "full suite from op-test"? My guess is that power is getting pulled out from underneath us while we're writing an attribute out to PNOR. That is simply an unsupported scenario. The only recovery is a full code update.
"After PNOR re-provision" - Can you be specific about what this means? I think I know, but I don't want to assume. If it is the operation I suspect, it will not fix the issue because it doesn't rewrite any of the code partitions, which this partition is. A full code update should make the problem go away.
@dcrowell77 Okay, There is no such scenario in that full suite, all the tests are pretty straight forward and all are OPAL/Linux tests. For lists of tests in full suite you can have a look at here https://github.com/open-power/op-test-framework/blob/master/op-test . And about PNOR re-provision BMC will try to erase some partitions which are marked as
Have you ever been able to run these tests successfully? If so, what has changed since?
There really isn't much to go on here. At the very least I'd like to see the complete SOL output for the tests prior to this fail, as that is likely when the problem happened. Also, and esels would be good.
I can't think of any other situations where we'd end up with a ECC error in this partition besides being interrupted in the middle of a write. Note that this write could occur during boot or at runtime (though I can't think of an attribute we'd actually write at runtime...). The only other thought I have is that the BMC corrupted something that we wrote or on the readback.