open-power / hostboot

System initialization firmware for Power systems
Apache License 2.0
75 stars 97 forks source link

What problem does "HW414700" solve? #213

Closed Grubby0624 closed 1 year ago

Grubby0624 commented 1 year ago

https://github.com/open-power/hostboot/blob/a34899267f2b8aaa94fd58c961a63b772b3b057f/src/import/chips/p9/procedures/hwp/memory/lib/fir/unmask.C#L307 When I tested the DIMM RAS on the Nimbus 2.1 CPU, I found that the system checkstop could not be triggered. After changing the ATTR to target the Nimbus 2.1 CPU, the checkstop and DIMM deconfiguration could be successfully triggered. Excuse me?

  1. What is the modification of "HW414700" for?
  2. In addition to setting the "Mainline UE" to checkstop, will setting this ATTR to true cause other impacts?
dcrowell77 commented 1 year ago
  1. HW414700 describes an early chip bug related to missing some SUE reporting. Prior to Nimbus 2.1 there was a chance of missing errors so with this setting applied it will force a checkstop in those cases.
  2. If you search in the code you can see that HW414700 is in a lot of places. It seems to affect more than just regular memory since I see it in some of the other initfiles too. In general it will cause more failures to checkstop versus properly failing with SUE/machinecheck.

Why are you interested in forcing checkstops for these kinds of errors? In general we would want the errors to flow upward into a possibly non-fatal machinecheck/SUE that the OS could handle accordingly. These systems are designed to avoid full system checkstops whenever possible.

Grubby0624 commented 1 year ago

Thank you for your answer

  1. Nimbus 2.1 will cause the OS to be stuck during the DIMM RAS test, and the system serial port continues to report the error "Memory failure: 0x20000000: reserved kernel page still referenced by 1 users" for several hours. I think this is abnormal and unacceptable from the perspective of use
  2. Nimbus 2.2/2.3 does not have this phenomenon. After the DIMM RAS test, the checkstop is triggered and the corresponding DIMM is restarted So I guess Nimbus 2.1 also has the bug of missing some SUE reporting. That's why I'm interested in "forcing checkstops for these kinds of errors"
dcrowell77 commented 1 year ago

It seems unlikely that DD2.1 has the bug but it went unaddressed. However, that level of part is technically only supported as part of the https://github.com/ibm-op-release/op-build branch. It looks like you are trying to use our most current code level. There are all sorts of other settings that could be incorrect for DD2.1 if you are using master. It is possible that you are missing some other tangentially related behavior that the OS interacts with to properly handle the error. It does seem like the OS knows the memory is bad, which I think means that the initial chip bug was fixed since that was a case of not reporting the error at all (a silent failure).