open-power / hostboot

System initialization firmware for Power systems
Apache License 2.0
74 stars 97 forks source link

In the p9 error injection trial, it took 20 minutes to generate a system checkstop #223

Open Amay258 opened 1 year ago

Amay258 commented 1 year ago

In the p9 error injection trial, there is a probabilistic problem. When executing the error injection instruction, such as 'putscom - c 0x8 0x07010A0D 0x0000000003AF0000', it took 20 minutes to generate a system checkstop and successfully deconfigure the DIMM that had the error inject.

Preliminary judgment shows that after the input control register (000000 07010A0D) was written, the corresponding fault isolation register (000000 07010A00) value was not modified, resulting in no Checkstop.

Directly writing to the corresponding FIR register can trigger a checkstop and successfully deconfigure the DIMM.

May I ask why the FIR value only changed after more than 20 minutes.

dcrowell77 commented 1 year ago

What interface are you using for the putscom? I don't recognize the syntax above.

My understanding of the way 0x07010A0D works is that it places errors into the hardware, but those errors are not surfaced until memory behind that memory controller is actually accessed. Therefore, unless you are explicitly forcing all of mainstore to be accessed (e.g. by running an exercisor of some kind) there will be some non-determinate results.

I also think you might be missing some bits that have to be set to control the injection. Bits 0:36 : EICR_ADDRESS: Error is injected when read address matches the EICR address, up to fields masked by the EICR region. 0 = dimm select 1:2 = mrank(0:1) 3:5 = srank(0:2) 6:7 = bank_group(0:1) 8:10 = bank(0:2) 11:28 = row(0:17) 29:36 = col(2:9) Without those bits set there will never be a match to trigger the inject.

Amay258 commented 1 year ago

Putscom - c 0x0 0x07010A0D 0x00000000003AF0000, this instruction is an error injection for CPU0_C0D0 Putscom - c 0x8 0x07010A0D 0x00000000003AF0000, this instruction is an error injection for CPU1_C0D0

After executing the injection error instruction, the normal situation is to immediately trigger checkstop, and the injection error is successful. But the current situation is that after executing the injection error command, sometimes it takes 20 minutes to trigger the checkstop,and the injection error is successful, but why do we need to wait for 20 minutes?

dcrowell77 commented 1 year ago

What do you mean by "the normal situation"? Have you seen other behavior with this specific injection? I still am under the belief that it won't fail until the memory is physically accessed, which is non-deterministic.