open-power / hostboot

System initialization firmware for Power systems
Apache License 2.0
75 stars 97 forks source link

Issue regarding MBE error events DIMM number #60

Open anoo1 opened 8 years ago

anoo1 commented 8 years ago

Reference: https://github.com/openbmc/openbmc/issues/299

@causten discussed this issue with hostboot team but there is still a request from originator to open an issue for hostboot and request more details.

dcrowell77 commented 8 years ago

I really don't know what the request is here, even after reading the referenced openbmc issue.

williamspatrick commented 8 years ago

A memory UE is expected to not be isolated to a single DIMM, so there is always multiple callouts. If I remember right, Hostboot does two things when there is an error with a guard event.

  1. Send down a SEL/eSEL.
  2. Assert fault sensors for the associated FRUs.

Since a SEL is associated with a sensor, in effect, the SEL also points to a FRU. What Chris was observing is only one FRU from the SEL path but all the FRUs from the fault sensor path. Since we treat sensors as "inventory" and SELS as "error events", which are handled by two different pieces of code, and the fault sensor doesn't actually have any correlation back to the SEL anyhow, the BMC is unable to show an association between the SEL and all the callout-FRUs; we only get the one for the SEL.

This information is contained deep in the eSEL, which the BMC is not presently able to parse and would require a bunch of HUID->FRU-ID mapping on BMC side anyhow. As best as we can tell there is no way to get this information in a pure IPMI sense.

anoo1 commented 8 years ago

Thanks for the info, that's what they were looking for. Closing.

williamspatrick commented 8 years ago

My comments were an attempt to explain the situation to @dcrowell77. @dcrowell77 with this in mind, do you know of any solution for us?

dcrowell77 commented 8 years ago

I'm still trying to grasp the observation since the data is reported in a format I'm not familiar with. Do you have the sel logs to go along with the error?

I'm a little confused about the description above. A SEL is nothing but a record that a sensor changed value, it doesn't exist on its own. In an error case there are 2 possible sensors that could be modified. There are fault sensors that indicate an error and provide the externally visible callout. For a given error log we assert the fault sensor that corresponds to the highest priority callout (or all of them if there is a tie). Each of those will generate a SEL. We also could modify the functional sensors for any target that we deconfigure, either directly or by association. Any changes to the functional sensors should also show up as SELs. The eSEL we generate is just for FFDC/FA, there is no expectation for it to be parseable by the BMC code. Do we have a copy of the eSEL that belongs to this situation? It would be helpful to see what the actual callouts are.

My theory here is that we have an error with a single dimm callout that then causes a deconfig-by-association of the other 3 dimms in the group, so everything is working as expected.

causten commented 8 years ago

First, forget about the specific memory failure, it is clouding up the request. It's quite simple. Even though the code is working as a HB developer expects, it leaves me wanting more. In fact it left a few others wanting more too because code was written to tie at least one SEL to the eSEL. Notice how an eSEL is created. The first 16 bytes of an eSEL is a SEL.

First 16B of this data should be SEL Event Format:
3rd byte: sensor type, 0xDF
10th byte: event revision #, either 0x03 or 0x04
Example structure:
typedef struct
{
INT16U ID;
INT8U Type; #should be 0xDF
INT32U TimeStamp;
INT8U GenID [2];
INT8U EvMRev; #should be 0x03 or 0x04
INT8U SensorType;
INT8U SensorNum;
INT8U DirType;
INT8U Signature; #should be 0xAA
INT16U offset;
} PACKED ExtendSELHead_T;
After the 16B is the actual extended SEL data

So since you went to the effort of linking a SEL to an eSEL why not tie all the SELs that were asserted to the eSEL. That's all the request is asking.

@dcrowell77 your observations are correct about the deconfig by association, the error is injected on 1 dimm with the other 3 getting knocked out. It's just that there is no way to tell from the other 3 SELs that they hit an error because of the same issue. That ends up causing distractions and questions and time in explaining to lots of people how our IPMI part works with POWER's deconfiguration policies.