FSP 2.0 skips MRC cache and forces MRC training on Intel SPS systems

c0d3z3r0 commented 4 years ago

On systems with Intel SPS MRC cache does not get used and memory retraining gets forced on every cold boot. This issue is confirmed on two boards (Supermicro X11SSH and X11SSM). One workaround is flashing ME instead of SPS.

@PatrickRudolph

nate-desimone commented 4 years ago

@c0d3z3r0, @PatrickRudolph:

It appears this is happening during the CPU Replacement Check. The CPU replacement check is a function in the ME firmware specifically for LGA socket (non-soldered down) motherboards that enables the FSP to detect if the user has replaced the CPU in the motherboard with a new one. If a new CPU is found, then the MRC training needs to be redone since there is some part-to-part variation in physical silicon characteristics.

I looked through the KBL FSP source code, and found the following snippet of code:

if (MeTypeIsSps ()) { // // SPS firmware does not support CPU replaced detection // *ForceFullTraining = TRUE; return EFI_SUCCESS; }

Hence, SPS firmware does not implement the CPU Replacement Check feature. Since it is not possible for the MRC to determine if the CPU is the same, it is forced to run the full training every time. So it appears this behavior is expected when using the SPS firmware.

Th3Fanbus commented 4 years ago

Hi Nate, I wonder: what would happen if this were to be replaced by a no-op? If the CPU has actually been replaced, I guess the old timings could not work properly, in which case a full reset and retrain should flush them, I guess? Or maybe it could be made configurable via an UPD.

nate-desimone commented 4 years ago

Hi @Th3Fanbus, the thing we are concerned about is the case of it kinda/sorta working, but not being the best training data. This may result in a performance degradation that would make the Intel processor appear to be slower than it actually is. Performance degradation has the potential to impact Intel's brand perception. For this reason we take the conservative approach of re-running training.

Its not actually not a bad idea to re-run training about once a year anyway. As the CPU ages, salt leeches into the silicon and alters its physical characteristics (for the worse.) Usually some other component of the computer breaks before the CPU... but like everything CPUs don't last forever. Re-running training will help mitigate the natural aging process. Since reliability is more important than boot time on server platforms we made this design decision.

c0d3z3r0 commented 4 years ago

@nate-desimone Couldn't that problem be solved by simply implementing the CPU replacement check in SPS firmware?

nate-desimone commented 4 years ago

@c0d3z3r0 Yes that would absolutely solve it. The SPS firmware team decided against implementing that feature for reasons unknown to me. I don't know who to ask as I don't know anyone from the SPS firmware team. Most of my work thus far has been on client platforms, which have a completely different ME firmware implementation and a different team.

c0d3z3r0 commented 4 years ago

Two suggestions from my side, if there is now way of talking to the SPS team:

document it
add a upd to force-ignore SPS

c0d3z3r0 commented 4 years ago

oh well, or just finally make FSP open-source, as Intel promised...

Th3Fanbus commented 4 years ago

Hi @nate-desimone, I understand that, given the stringent requirements of server environments, a more conservative approach was chosen for them. In addition, servers seldom need to be rebooted, so longer boot times due to memory retraining are not a problem. However, in a workstation with a server mainboard, longer boot delays significantly degrade user experience. Therefore, it would be reasonable to make this behavior configurable.

Although I am not a lawyer, I believe that the FSP license does not allow modifying a FSP binary so that it does not forcefully retrain on every boot. In any case, it would not be an ideal solution. Moreover, given Skylake's age, I would not expect any feature updates for its SPS firmware: adding a mere CPU replacement check is not worth the costs and risks of rolling an update of such a highly privileged piece of software.

Considering that, other options to avoid training delays I could think of:

Read the flash chip of another Skylake board with a regular ME firmware. Then hope that ME firmware works, because it likely contains data specific to the donor board that is incompatible with the recipient board.
Extract a "clean slate" ME firmware from a firmware updater for another mainboard. Of course, provided that the license terms of that firmware updater allows doing so, which is usually not the case.
Just add a new FSP-M UPD to control this behavior, which could default to the current behavior for compatibility purposes.

I would say the latter proposal is a reasonably simple enhancement to ask for. What do you think?

n-huber commented 4 years ago

@nate-desimone, I just had this thought: For boards that are not designed for CPU hotplugging, we can be reasonably sure that the CPU wasn't changed if we boot from S5 (not G3). Wouldn't this be something that could easily be implemented in FSP? i.e. extend the warm-boot behaviour to boots from S5?

nate-desimone commented 4 years ago

Hi All,

@Th3Fanbus - With regard to workstation platforms... we encourage OEMs to use the regular ME firmware on workstations for this exact reason. If you have SPS firmware flashed originally by your OEM, then your PCH is actually fused for SPS and won't run the regular ME firmware even if you were to load the regular ME binary on to the flash.

@n-huber - This is actually a check for a cold swap of the CPU. We run the "fast" memory training flow even on a cold boot. CPU hotplug is not supported on this platform at all.

c0d3z3r0 commented 4 years ago

If you have SPS firmware flashed originally by your OEM, then your PCH is actually fused for SPS and won't run the regular ME firmware even if you were to load the regular ME binary on to the flash.

huh? are those SPS fuses documented anywhere? My machine runs ME fine, even though being shipped with SPS

nate-desimone commented 4 years ago

You are right on this chipset that will work. On other chipsets there are issues.

c0d3z3r0 commented 4 years ago

are those SPS fuses documented anywhere?

JayTalbott commented 4 years ago

I'm using CFL. On the CFL-H CRB, I replaced the original CPU that came with the CRB with one that is the same SKU as the customer board. Would that cause retraining on every boot?

If so, once it's retrained after the change in the CPU, is there way to inform the ME that the memory has been retrained so that it doesn't cause the retraining every time unnecessarily?

Would building a new IFWI with the latest ME kit (instead of just stitching SBL into the original IFWI extracted from the CRB) so that you get a clean/fresh ME image with no knowledge of the original CPU solve this problem?

Thanks!

Th3Fanbus commented 4 years ago

Hi Jay,

I'm using CFL. On the CFL-H CRB, I replaced the original CPU that came with the CRB with one that is the same SKU as the customer board. Would that cause retraining on every boot?

As I understand the CpuReplacementCheck in MRC, it should only force full training once.

If so, once it's retrained after the change in the CPU, is there way to inform the ME that the memory has been retrained so that it doesn't cause the retraining every time unnecessarily?

The ME should record which CPU is currently installed somewhere within the ME region. Of course, if you flash a firmware image whose ME firmware thinks the currently-installed CPU is the old one, then the CpuReplacementCheck would trigger again.

Would building a new IFWI with the latest ME kit (instead of just stitching SBL into the original IFWI extracted from the CRB) so that you get a clean/fresh ME image with no knowledge of the original CPU solve this problem?

It should work, yes. Something that would also work for a single board is to extract the IFWI after booting with the new CPU, and use it when stitching.

Thanks!

Hope this helps!

nate-desimone commented 4 years ago

Hi @JayTalbott, I agree with @Th3Fanbus that is is probably a good idea to re-stitch the ME. Technically, ME can cause the full training to happen under the following circumstances:

The CPU was cold-swapped since the last boot
The ME encountered an error while checking for CPU Replacement (This could happen if the ME is an older version and does not have all the newest CPUIDs)
The ME is in recovery mode or otherwise "out to lunch", the FSP reaches its time out period and assumes that no response means run a full training for safety.
The SPS variant of ME is being used (which does not implement the CpuReplacementCheck)

JayTalbott commented 4 years ago

I'm currently using the original ME that came on the CFL-H CRB.

I restored the original BIOS image that was on the CRB when I first received it, and it does the same thing - always retrains on every boot - with the different CPU.

I will try rebuilding the IFWI with a more recent ME kit.

nate-desimone commented 4 years ago

My guess is that ME image was built before support for the newer SKUs was added. Re-stitching with a new ME kit will probably help.

JayTalbott commented 4 years ago

New ME version solved the problem.

Thanks everybody!

intel / FSP

FSP 2.0 skips MRC cache and forces MRC training on Intel SPS systems #41