open-power / hostboot

System initialization firmware for Power systems
Apache License 2.0
74 stars 97 forks source link

Why do different L2 caches react differently to the same data in the "L2 Error Injection Register" after error injection? #236

Closed Theo0208 closed 9 months ago

Theo0208 commented 9 months ago

Writing the same data(A9E0000000000000) to L2 Error Injection Registers of different L2 Caches for error injection, the subsequent response is different: For some L2 Error Injection Registers(such as "2002800C" of proc0),after error injection, the corresponding core is deconfigured, and the value of bit 7 of corresponding EQ_L2_FIR is 1,which represents "L2 directory read UE".

But for other L2 Error Injection Registers (such as "2602100C" of proc3),after error injection,the corresponding processor is deconfigured, the values of bit 7 and bit 20 of the corresponding EQ_L2_FIR are both 1.And the value of bit 20 is 1 represents "RC incoming Power Bus data had a UE error", this error resulted in the deconfiguration of processor.

I want to know why different L2 caches react differently to the same data in the "L2 Error Injection Register" after error injection, and why the above problem occurs, thanks.

dcrowell77 commented 9 months ago

The question is really outside the scope of the Hostboot firmware so I may not be able to answer all the questions.

At what point in the IPL are you doing this injection?

Theo0208 commented 9 months ago

Thank you, dcrowell77. When executing error injection, IPL has been completed and the machine is in OS running state.

dcrowell77 commented 9 months ago

I wouldn't have expected any L2 error injection to result in a processor chip callout but that is all a knowledge domain I know little about. I'm asking around to see if the behavior makes much sense. One thing to note is that the injection doesn't actually cause an error directly. It basically sabotages the logic to make the next (normally) triggered operation fail. That means that you could get different behavior based on exactly which operation happens to hit the cache next. Maybe sometimes it is a local core and other times it is a remote chip on the SMP. I could see how that might change the behavior.

dcrowell77 commented 9 months ago

I conferred with the hardware team and their answer is basically what I said.

looks like it is doing dw_ue_next, cw_ue_next, stq_pe_every, cpi_ce_next into the hardware error injection register This will explain the differences of the reactions on each core. It's basically which ever error gets injected and then detected first which will be different depending on what's happening at the time on the core I suggest you only inject 1 type of error at a time not all at the same time

Theo0208 commented 9 months ago

Okay,Thanks a lot.