open-power / op-build

Buildroot overlay for Open Power
GNU General Public License v2.0
105 stars 183 forks source link

ECC error in PNOR flash in section offset 0x00091000 #5823

Open adeelleo opened 11 months ago

adeelleo commented 11 months ago

Hi Gurus,

I am trying to bring an S822LC (8335-GTB) back to life and use for AI workloads. The System has 2 x Power8 10 Core Processors, 512GB RAM & 4 x Nvidia P100 GPUs.

After a Power Failure. The machine gets stuck at boot with the below error message:

ECC error in PNOR flash in section offset 0x00091000

System shutting down with error status 0x60F System shutting down with error status 0x90000A79

Can anyone suggest how to recover from this.

I am willing to compensate anyone who can put in the efforts to help me resolve this for his time.

image

IlyaSmirnov91 commented 11 months ago

There are very few people around the last couple of weeks of the year, and unfortunately nobody who's familiar with Power 8.

I can give you a couple of things to try though: 1) Re-flash the PNOR image. There is a chance that the ECC error will go away. 2) Guard the PNOR device or replace it. 3) Or re-flash the BMC image if you're running with a BMC.

adeelleo commented 11 months ago

There are very few people around the last couple of weeks of the year, and unfortunately nobody who's familiar with Power 8.

I can give you a couple of things to try though:

1. Re-flash the PNOR image. There is a chance that the ECC error will go away.

2. Guard the PNOR device or replace it.

3. Or re-flash the BMC image if you're running with a BMC.

Thanks for the replay.

The suggestions you gave should technically resolve the issue.

Any idea how i would re-flash the PNOR image?

Replacement is not an option. Since this part is not readily available and the few replacement options i got are costing more than the server itself.

I have downloaded the latest firmware package that contains the PNOR & BMC firmware. But unfortunately i can not access the machine through IPMI Tool to flash firmware since i don't remember the IP address of the machine. Any idea how i can find the IP address so that i can connect through IPMI?

I tried wireshark to sniff the IP but was not successful.

Thanks for your time,

IlyaSmirnov91 commented 11 months ago

You could ping the machine if you remember the alias - that will give you it's IP.

I found this in our P8 documentation to flash the new images:

ipmitool -H <IP> -z 20000 -I lanplus -U <user> -P <password> hpm upgrade <image> component <0|1|2>

0,1 are BMC images, 2 is the PNOR

adeelleo commented 11 months ago

Thanks.

I have the firmware update instructions and the latest firmware.

But I don't remember the IP address or the host name. So I can't ping or access the machine through IPMI.

Any idea how I can access in this situation.

Best regards,

Adeel Akram

On Wed, Dec 27, 2023, 7:31 PM Ilya Smirnov @.***> wrote:

You could ping the machine if you remember the alias - that will give you it's IP.

I found this in our P8 documentation to flash the new images:

ipmitool -H -z 20000 -I lanplus -U -P hpm upgrade

component <0|1|2> 0,1 are BMC images, 2 is the PNOR — Reply to this email directly, view it on GitHub , or unsubscribe . You are receiving this because you authored the thread.Message ID: ***@***.***>
dcrowell77 commented 10 months ago

Without the BMC's IP address your options are pretty limited. The entire service model is based around the BMC. Note that the BMC should have a completely separate ethernet connection compared to the "system" itself. The PDF at https://public.dhe.ibm.com/systems/power/docs/hw/p8/p8eik_install_8335.pdf has a good diagram in Figure 17. Use the left Ethernet port for the BMC/IPMI interface (as eth0). Use the right Ethernet port for any direct OS usage (as eth1). Once you get BMC access again there are a few things you can try.

Do you see multiple failed boot attempts on each power on? There are multiple sides to the PNOR and a golden side fallback that is supposed to kick in to recover from failures like this.

adeelleo commented 10 months ago

Without the BMC's IP address your options are pretty limited. The entire service model is based around the BMC. Note that the BMC should have a completely separate ethernet connection compared to the "system" itself. The PDF at https://public.dhe.ibm.com/systems/power/docs/hw/p8/p8eik_install_8335.pdf has a good diagram in Figure 17. Use the left Ethernet port for the BMC/IPMI interface (as eth0). Use the right Ethernet port for any direct OS usage (as eth1). Once you get BMC access again there are a few things you can try.

Do you see multiple failed boot attempts on each power on? There are multiple sides to the PNOR and a golden side fallback that is supposed to kick in to recover from failures like this.

Thanks for your time.

I am aware of the separate BMC Port and that is what I am connected to. I know since this Port gives the display output on serial connection with the machine.

The only issue is that I am unable to establish an IPMI connection since I don't remember the IP address or hostname of the machine. I tried sniffing the network connection with Wireshark but wast successful in detecting any IP address.

I only see the same boot failure message I attached the screenshot in my first message.

Is there a way to manually switch to the golden side of the PNOR image on this machine?

dcrowell77 commented 10 months ago

The BMC is where all of the control is, there are no other external interfaces. If you can't get into the BMC somehow there isn't much you can do. Have you gone through all of the service documents at the page I posted? There might be some other way of getting into the BMC. I'm pretty sure there is a raw serial port somewhere that you can use for BMC (vs Host) access.