mupq / pqm4

Post-quantum crypto library for the ARM Cortex-M4
296 stars 75 forks source link

Add an option to use the CCM as main RAM (or at least as stack) #180

Open pornin opened 3 years ago

pornin commented 3 years ago

While doing some benchmarks of some other code on a STM32F407 board, I noticed that I would get a 10% speed-up by putting the RAM in the CCM instead of SRAM1; or, more accurately, that with the main RAM in SRAM1 I was getting extra clock cycles that could not be accounted for by simply following the assembly instructions. I was using the 24 MHz clock setup that is also used in pqm4, so the Flash access for the code was "0 wait state".

The reference manual promises (page 59) that "the bus matrix provides access from a master to a slave, enabling concurrent access and efficient operation even when several high-speed peripherals work simultaneously" but I fear that this is not always true in an absolute sense. What I observe can be explained if there are resource usage conflicts in the interconnection matrix between data accesses and instruction fetching. Note that the I-cache and D-cache, when enabled (I tried both), don't help, since they sit between the Flash and the matrix, not between the matrix and the CPU. Since the CCM can be accessed directly by the CPU, without going through the matrix, it does not suffer from similar conflict and thus generates no wait states.

(By the way, this explains why the CCM exists at all. The CCM is less convenient for general programming since peripherals cannot DMA into it; but it's faster for CPU-only purposes.)

Unfortunately, on these microcontrollers, the CCM is shorter than SRAM (only 64 kB) so that some algorithms that can work over SRAM1 don't necessarily work as well in SRAM. But I think it is worth adding an option to put the RAM (or at least the stack) into the CCM and see if that yields a speed-up. A very simple way to do that is to modify the linker script, to declare that the ram space is at address 0x10000000 instead of 0x20000000.

rpls commented 3 years ago

We are in the process of a fairly mayor rewrite (see #174, nearly complete). In that context, there were also some discussions on how to handle different memory regions. The STM32F407, e.g., has an upper 16kb portion in the continuous 0x2000000 ram segment, which appears to exhibit different timings. There are almost certainly similar situations on the many now supported chips. Now we could, in principle, support different linker scripts for different schemes to, for example, place certain data in favorable memory sections. But the question isn't necessarily what we can do (and how), but maybe also what we may want to measure and compare with the benchmarks. Is it "how fast is scheme X at best on platform Y" (i.e., best possible speed?) or "how does scheme X compare to all others on platform Y under the same conditions". For the former we place everything in the fastest memory possible for each scheme, for the latter we try to find the best fit for all schemes (and as you said, the CCM is usually smaller and may not suit all schemes).

Btw. I had some funny results when using CCM memory for either Instructions (not all CCMs support Instructions, but some do) or Data. Sometimes the Flash with Cache produced faster results. There's definitely some research to be done.

Long story short: we're looking into it, any anecdotes about performance of any schemes or memory timing funny-business are appreciated and might help.