intel / FSP

Intel(R) Firmware Support Package (FSP)
Other
292 stars 127 forks source link

[Enhancement] Proposal FSP-M multi-core ram training #21

Closed zaolin closed 5 years ago

zaolin commented 5 years ago

Hey Intel FSP team,

We have a feature request to speed up the ram training by using multiple cores for it.

nate-desimone commented 5 years ago

Hi @zaolin,

Believe it or not, multi-threading won't actually make a difference here.

The reason why is because there are independent finite state machines in Intel's memory subsystem design for each memory channel. We activate those state machines and then they run independently and in parallel without the processor core doing anything. At a high level, the memory training algorithms pretty much boil down to something like this:

foreach(memory_channel) {
  activate_artificial_memory_traffic_fsm();  //Just programs a bunch of registers, very fast
}

// At this point, a bunch of synthetic memory write + reads are happening
// in the entire system regardless of whatever the x86 processor is doing

foreach(possible_voltage_level_setting) {
   foreach(memory_channel) {
     program_voltage_level();  //Just programs a bunch of registers, very fast
   }
   sleep(dwell_time);  //dwell_time is a relatively large number
                       //(in the range of milliseconds)
                       //It has to be in order to statistically
                       //guarantee our results are accurate...
                       //someone does a bunch of math to
                       //figure out what this number should be

   errors = check_for_errors(); //This reads an error count register
                                //from a hardware comparator to see how many times the
                                //artificially generated data written to DRAM didn't match
                                //the data read back from DRAM. Again, very fast.
}
foreach(memory_channel) {
  deactivate_artificial_memory_traffic_fsm();  //Just programs a bunch of registers, very fast
}
do_some_math_to_figure_out_the_best_voltage_level();
foreach(memory_channel) {
  program_voltage_level();
}

I should mention this is an over-simplified picture, we do this same code snippet across 100s of different parameters and at different memory frequencies. We also have some very proprietary power & performance optimization algorithms that run on top of this.

The key take away though, is that all the time it takes to run the MRC is basically just the x86 processor sitting in that call to sleep(). The CPU isn't the bottleneck, its just the fact that the laws of nature dictate that we have to wait for millions of memory transactions to occur before taking an error measurement due to the inherent entropy of the universe.

As you can see, our memory testing methods in MRC are quite a bit different from something like memtest86 and are much more sophisticated (at the expense of being extremely specific for every memory subsystem design and therefore changing a lot every year since we release new products annually.) Because of the hardware acceleration gained by our FSMs, we already get a fair amount of parallelism without actually having to do any parallel programming.

Now with all that said, since we have MP_SERVICES_PPI in coreboot now the option to explore the possibility of multi-threading MRC does now exist. I'll bring feedback to the MRC team to consider multi-threading more of a possibility now.

zaolin commented 5 years ago

Okay thanks for the info!