open-power / hostboot

System initialization firmware for Power systems
Apache License 2.0
74 stars 97 forks source link

How can I trigger this function #232

Closed Grubby0624 closed 9 months ago

Grubby0624 commented 10 months ago

https://github.com/open-power/hostboot/blob/7d5ac4bb7b94bdfaed85d71395f266a23a304339/src/usr/diag/prdf/common/plat/mem/prdfMemCeTable.C#L48C1-L48C1

Hello, I am currently researching the error handle function of memory and have some basic questions that I would like to ask. I understand that this function adds a CE data to MemCeTable, and when Ce Errors accumulate to a certain extent, it will trigger a UE error? May I ask:

  1. CE error reaching the critical value will trigger UE error, right?
  2. How can I trigger this function under runtime conditions
  3. Under runtime conditions, where will statements similar to "TRACFCOMP" in prdf be printed?
dcrowell77 commented 10 months ago

@zane131 @cnpalmer - Please help with the details.

  1. It depends on what you need by "critical value". If the hardware cannot correct the error, then it will escalate to a UE. For example, if there are multiple bit failures at the same time that ECC can't correct. From a software standpoint, PRD will also keep a count of CEs. If that counts passes a threshold (e.g. X CEs in 24 hours) then PRD will create a visible log and predictively guard the dimm and ask the hypervisor to vacate the failing memory.

  2. At a high level, you need to cause CEs in the memory. There are FIR injection registers to use for that, as well as potentially ways to physical bug the hardware.

  3. At runtime PRD runs inside of HBRT. If you are using OPAL (as I suspect) then this happens inside the opal-prd application. The traces are then saved into the linux logs ('/var/log/opal-prd.log' or 'journalctl -u opal-prd' depending on which distro is being used).

zane131 commented 10 months ago

I think the key here is that hardware, not firmware, detects and reports the CE and UE conditions based on the ECC algorithm. This firmware (PRD) simply responds to reported hardware events and requests service as needed.

Grubby0624 commented 10 months ago

Hi dcrowell77 "From a software standpoint, PRD will also keep a count of CEs. If that counts passes a threshold (e.g. X CEs in 24 hours) then PRD will create a visible log and predictively guard the dimm" This is exactly the function I am concerned about, and I want to trigger this function by writing FIR injection registers:

  1. I added some logs in MemCeTable:: addEntry through "PRDF-ERR"
  2. When I submit to EXPLR RDF When writing NCE error in EICR (0x08011C0D) (I saw through getscom that the bit bit corresponding to EXPLR_RDF-FIR (0x08011C00) stood up)
  3. However, the log I added was not found in/var/log/opal prd. log. I speculate that addEntry did not enter at all, so there is no possibility of triggering UE when the threshold is reached? Could you please check if there is a problem with my understanding. If it is true that "MemCeTable:: addEntry" did not trigger, I will debug why opal prd did not call this function in the future. Thank!
Grubby0624 commented 10 months ago

Hi zane131 " detects and reports the CE and UE conditions based on the ECC algorithm." I want to test this feature by injecting an error register. In this case, if a CE error is recorded in the FIR register, how does it trigger PRD to record a CE error?

zane131 commented 10 months ago

so there is no possibility of triggering UE when the threshold is reached?

A memory UE happens with the hardware ECC algorithm is unable to correct the error. Depending on the hardware this would require 2 or 3 CEs on the same cache line originating from different DRAMs on that cache line. There is nothing in firmware that would initiate a UE on threshold.

I want to test this feature by injecting an error register.

This should be discussed through your IPS contacts.

dcrowell77 commented 10 months ago

how does it trigger PRD to record a CE error?

The hardware sets a FIR bit. That then flows up the various layers of the FIR tree until it reaches the top (global) level. At that point the chip will assert some kind of interrupt. For Explorer, that signal goes across the OMI bus into the processor's memory controller logic. From that point it will flow up the FIR tree again. When it reaches the top of the P10 tree it will cause a host interrupt to fire. That interrupt is seen by Linux, passed to the Opal driver, then HBRT is called to analyze the attention.

Grubby0624 commented 10 months ago

"HBRT is called to analyze the attention." I think I can now proceed to this location, but I still have a question. I hope you can help me explain: When ceTableRc is not 0, SetCallOut will only be executed when areDramRepairsDisabled () returns true. Does this mean that even when Ce reaches the threshold, Predictive Callout will not be triggered when reDramRepairsDisabled () returns false?

Grubby0624 commented 10 months ago

This is the location of the relevant code: https://github.com/open-power/hostboot/blob/98f0a2536841224994ff1c1595ae72ccba4d839c/src/usr/diag/prdf/common/plat/mem/prdfMemEccAnalysis.C#L540C26-L540C26

zane131 commented 10 months ago

You're digging into some pretty complicated design. "DRAM Repairs" encompasses a set of procedures to test memory for errors in a target scope (usually on a single rank basis) and possibly apply repairs with redundant hardware. One of these procedures is called TPS, which is called in the else statement. So the check is "are DRAM Repairs disabled?" If so, make an immediate callout. If not, start the TPS procedure.

Grubby0624 commented 9 months ago

I think I can successfully trigger this function, thank you very much.