open-power / op-build

Buildroot overlay for Open Power
GNU General Public License v2.0
103 stars 183 forks source link

ECC error corruption results in system shutdown after running full suite from op-test on P8 habanero #2161

Open pridhiviraj opened 6 years ago

pridhiviraj commented 6 years ago
OpTestSystem TRANSITIONED TO: 2
^[[40m^[[2J^[[-1;-1f^[[?25l^[[-1;-1f^[[40m^[[37mUbuntu 16.04^[[-1;-1f^[[37m.  ^[[37m.  ^[[37m.  ^[[37m.^S^Q^[[-1;-1f^[[40m^[[37mUbuntu 16.04^[[-1;-1f^[[33m.  ^[[37m[ 1262.171350979,5] OPAL: Shutdown request type 0x0...
.  ^[[37m.  ^[[37m.^[[-1;-1f^[[40m^[[37mUbuntu 16.04^[[-1;-1f^[[33m.  ^[[33m.  ^[[37m.  ^[[37m.^[[-1;-1f^[[40m^[[37mUbuntu 16.04^[[-1;-1f^[[33m.  ^[[33m.  ^[[33m.  ^[[37m.[ 1154.680029] reboot: Power down
^[[-1;-1f  1.02292|ECC error in PNOR flash in section offset 0x00031000

  1.02453|System shutting down with error status 0x60F
  1.03057|System shutting down with error status 0x90FF0001
  1.49257|ECC error in PNOR flash in section offset 0x00031000

  1.49719|System shutting down with error status 0x60F
  1.50022|System shutting down with error status 0x90FF0002
pridhiviraj commented 6 years ago

After PNOR re-provision also system boot failing with


  1.73616|System shutting down with error status 0x60F
  1.73719|System shutting down with error status 0x90FF0001
  1.49203|ECC error in PNOR flash in section offset 0x00031000

  1.49665|System shutting down with error status 0x60F
  1.49768|System shutting down with error status 0x90FF0002

PNOR level:

 Product Name          : OpenPOWER Firmware
 Product Version       : open-power-habanero-v2.0-33-gb536a49
 Product Extra         :    buildroot-2018.02.2-7-gcb36c6d
 Product Extra         :    skiboot-v6.0.1-27-g34e9c3c1edb3
 Product Extra         :    hostboot-p8-d3025f5-p145344c
 Product Extra         :    occ-p8-28f2cec
 Product Extra         :    linux-4.16.13-openpower1-pa8348b9
 Product Extra         :    petitboot-1.8.0
 Product Extra         :    machine-xml-6a784
dcrowell77 commented 6 years ago

1.49203|ECC error in PNOR flash in section offset 0x00031000

If you look at the PNOR xml file you can see which section is corrupted. It appears to be the HBD section. https://github.com/open-power/pnor/blob/master/p8Layouts/defaultPnorLayoutWithGoldenSide.xml#L98

What happens in the "full suite from op-test"? My guess is that power is getting pulled out from underneath us while we're writing an attribute out to PNOR. That is simply an unsupported scenario. The only recovery is a full code update.

"After PNOR re-provision" - Can you be specific about what this means? I think I know, but I don't want to assume. If it is the operation I suspect, it will not fix the issue because it doesn't rewrite any of the code partitions, which this partition is. A full code update should make the problem go away.

pridhiviraj commented 6 years ago

@dcrowell77 Okay, There is no such scenario in that full suite, all the tests are pretty straight forward and all are OPAL/Linux tests. For lists of tests in full suite you can have a look at here https://github.com/open-power/op-test-framework/blob/master/op-test . And about PNOR re-provision BMC will try to erase some partitions which are marked as and also this HBD is not marked as reprovisionable. So that answers the why PNOR re-provision didn't help. I am assuming full code update will recovery it. But want to see why this corruption is happening.

dcrowell77 commented 6 years ago

Have you ever been able to run these tests successfully? If so, what has changed since?

There really isn't much to go on here. At the very least I'd like to see the complete SOL output for the tests prior to this fail, as that is likely when the problem happened. Also, and esels would be good.

I can't think of any other situations where we'd end up with a ECC error in this partition besides being interrupted in the middle of a write. Note that this write could occur during boot or at runtime (though I can't think of an attribute we'd actually write at runtime...). The only other thought I have is that the BMC corrupted something that we wrote or on the readback.