Code size for resource-constrained target

magnmaeh commented 5 months ago

This issue comes from the discussion: https://github.com/lf-lang/lingua-franca/pull/2262#issuecomment-2090492413

FlexPRET uses scratchpad memories for instruction and data memory (ISPM and DSPM, respectively). This provides predictable timing, but is very expensive. When realized on an Field Programmable Gate Array (FPGA), ISPM and DSPM is realized as block RAM (BRAM), which is a limited resource. For FlexPRET running on Zedboard FPGA, a realistic amount of ISPM and DSPM is ~64 kB each. One can be increased if the other is reduced.

This means code size must be as small as possible. To achieve this, FlexPRET has its own lightweight implementation of printf. However, artifacts from newlib's printf are still present in the final executable. This drastically increases code size to the point where code is no longer runnable. The newlib artifacts appear because when e.g., calling ctime, it references many other newlib functions. The following is the chain of references newlib functions from calling ctime.

ctime -> tzset -> _tzset_unlocked_r -> _malloc_r, siscanf siscanf -> __ssvfiscanf_r

From just a single call to ctime, references to many large functions have been created and must be linked into the final executable. _malloc_r is ~2kB, while __ssvfiscanf_r is ~6.7 kB. There are most likely multiple "chains" like this. The result is that a simple test concurrent/TimeoutZeroTest.lf produces code too large to run on FlexPRET realized on an FPGA. (In the case below the test was compiled for the emulator, which can have unlimited ISPM/DSPM.)

Memory region         Used Size  Region Size  %age Used
             ROM:      136364 B       256 KB     52.02%
             RAM:       17952 B       256 KB      6.85%

There are probably many other "top-level" functions that have similar chains of references. The best solution is to find all these and sanitize them out from the source code/replace them with lightweight versions when compiling for embedded systems. However, that probably takes some effort.

The second best solution is to filter out some of the massive newlib function. That can be done using the linker's --wrap flag. (https://linux.die.net/man/1/ld.) When passing --wrap=function, all calls to function will be replaced with __wrap_function and function is renamed to __real_function. In this way, wrappers around a function can be created:

void __wrap_function(int arg) {
    // Code to execute before function
    __real_function(arg);
    // Code to execute after function
}

However, if the __real_function is not called, there are no longer any references to it. Therefore, a way to entirely remove all references to e.g., __ssvfiscanf_r (and save lots of code space) looks like this:

Pass --wrap=__ssvfiscanf_r to linker.

void __wrap___ssvfiscanf_r(void) {}

This works because we are not really using __ssvfiscanf_r. It is just an artifact of newlib.

On the FlexPRET support PR, this was done for three functions: _vfprintf_r, __ssvfiscanf_r, _svfiprintf_r. This reduced the code size of the test concurrent/TimeoutZeroTest.lf by more than 50%.

Memory region         Used Size  Region Size  %age Used
             ROM:       67228 B       256 KB     25.65%
             RAM:       14144 B       256 KB      5.40%

The three functions were found by running nm <compiled program> --size-sort, which shows all the symbols in a program sorted by size. The three largest functions related to printing/scanning were selected.

There are exactly two files that implement this. low_level_platform/api/CMakeLists.txt and low_level_platform/impl/src/lf_flexpret_stubs.c. A goal should be to remove the --wrap solution and instead sanitize the source code for high-footprint functions.

lhstrh commented 5 months ago

This is a great summary. Thank you so much for taking the time to write it down and posting it here!

erlingrj commented 5 months ago

Great, this is an issue for all resource-constrained targets, calling ctime is done to print out the start time, date, year when the program starts. This can be omitted for embedded targets. We could put it behind a #ifdef NO_TTY

magnmaeh commented 5 months ago

Yes, actually, I thought there would be many functions like ctime and therefore the task of fixing this would be quite large. But in fact, there is only one other function like ctime that contributes to massive code size; fprintf. When both these are filtered out, it has the same effect as doing the --wrap stuff.

Since fprintf is used to print messages to stderr, it makes sense to just replace this by a normal printf when NO_TTY is defined.

So I'll actually just implement these changes in the PR right away and step away from the --wrap stuff.

magnmaeh commented 5 months ago

Haha that actually worked quite well. Now I'm getting this code size, which is even better:

Memory region         Used Size  Region Size  %age Used
             ROM:       61120 B       256 KB     23.32%
             RAM:       13696 B       256 KB      5.22%

Now I feel silly from spending all that time exploring the --wrap stuff... But at least I learned some things :) In my latest commit I removed the wrap stuff and instead added

// FlexPRET has no tty
#define NO_TTY

// Likewise, fprintf is used to print to `stderr`, but FlexPRET has no `stderr`
// We instead redirect its output to normal printf
#define fprintf(stream, fmt, ...) printf(fmt, ##__VA_ARGS__)

to lf_flexpret_support.h. Also I did this for ctime:

#ifdef NO_TTY
  lf_print("---- Start execution ----");
#else
  lf_print("---- Start execution at time %s---- plus %ld nanoseconds", ctime(&physical_time_timespec.tv_sec),
           physical_time_timespec.tv_nsec);
#endif // NO_TTY

lsk567 commented 4 months ago

Thanks for the great insight, @magnmaeh . It is surprising how big of an impact print functions could have on code size!

lf-lang / reactor-c

Code size for resource-constrained target #418