Closed Grubby0624 closed 1 year ago
Why are you interested in forcing checkstops for these kinds of errors? In general we would want the errors to flow upward into a possibly non-fatal machinecheck/SUE that the OS could handle accordingly. These systems are designed to avoid full system checkstops whenever possible.
Thank you for your answer
It seems unlikely that DD2.1 has the bug but it went unaddressed. However, that level of part is technically only supported as part of the https://github.com/ibm-op-release/op-build branch. It looks like you are trying to use our most current code level. There are all sorts of other settings that could be incorrect for DD2.1 if you are using master. It is possible that you are missing some other tangentially related behavior that the OS interacts with to properly handle the error. It does seem like the OS knows the memory is bad, which I think means that the initial chip bug was fixed since that was a case of not reporting the error at all (a silent failure).
https://github.com/open-power/hostboot/blob/a34899267f2b8aaa94fd58c961a63b772b3b057f/src/import/chips/p9/procedures/hwp/memory/lib/fir/unmask.C#L307 When I tested the DIMM RAS on the Nimbus 2.1 CPU, I found that the system checkstop could not be triggered. After changing the ATTR to target the Nimbus 2.1 CPU, the checkstop and DIMM deconfiguration could be successfully triggered. Excuse me?