raspberrypi / pico-sdk

BSD 3-Clause "New" or "Revised" License
3.8k stars 953 forks source link

LTO support and C/CXX flags override hard to achieve #97

Open profi200 opened 3 years ago

profi200 commented 3 years ago

First of all thanks for making this so painless. It worked basically right out of the box even with a different arm-none-eabi toolchain.

LTO support Linking is failing with these errors:

/tmp/ccbJpNID.s: Assembler messages:
/tmp/ccbJpNID.s:25: Error: selected processor does not support `sev' in ARM mode
/tmp/ccbJpNID.s:53: Error: selected processor does not support `blx r3' in ARM mode
/tmp/ccbJpNID.s:117: Error: selected processor does not support `blx r3' in ARM mode
/tmp/ccbJpNID.s:406: Error: selected processor does not support `bkpt #0' in ARM mode
/tmp/ccbJpNID.s:427: Error: selected processor does not support `bkpt #0' in ARM mode
/tmp/ccbJpNID.s:449: Error: selected processor does not support requested special purpose register -- `mrs lr,PRIMASK'
/tmp/ccbJpNID.s:452: Error: selected processor does not support `cpsid i' in ARM mode
/tmp/ccbJpNID.s:463: Error: selected processor does not support `dmb' in ARM mode
/tmp/ccbJpNID.s:501: Error: selected processor does not support `dmb' in ARM mode
/tmp/ccbJpNID.s:510: Error: selected processor does not support requested special purpose register -- `msr PRIMASK,lr'
/tmp/ccbJpNID.s:583: Error: selected processor does not support `blx r3' in ARM mode
/tmp/ccbJpNID.s:618: Error: selected processor does not support `blx r3' in ARM mode
/tmp/ccbJpNID.s:714: Error: selected processor does not support `blx r2' in ARM mode
/tmp/ccbJpNID.s:729: Error: selected processor does not support `blx r3' in ARM mode
lto-wrapper: fatal error: /opt/devkitpro/devkitARM/bin/arm-none-eabi-g++ returned 1 exit status
compilation terminated.
/opt/devkitpro/devkitARM/lib/gcc/arm-none-eabi/10.1.0/../../../../arm-none-eabi/bin/ld: error: lto-wrapper failed
collect2: error: ld returned 1 exit status
make[2]: *** [CMakeFiles/blink.dir/build.make:691: blink.elf] Fehler 1
make[2]: Verzeichnis „/home/blub/pico/blink/build“ wird verlassen
make[1]: *** [CMakeFiles/Makefile2:1494: CMakeFiles/blink.dir/all] Fehler 2
make[1]: Verzeichnis „/home/blub/pico/blink/build“ wird verlassen
make: *** [Makefile:103: all] Fehler 2

These errors are quite strange since it's usually the other way around with something not being supported in thumb mode. Have not digged too deep yet to find out what's causing these.

C/CXX flags I tried to override the C and linker flags to have different optimization than default by doing

cmake .. -DCMAKE_C_FLAGS="-O2 -flto -fno-data-sections" -DCMAKE_EXE_LINKER_FLAGS="-O2 -flto"

The result is this order of flags to the compiler: ...-I/home/blub/pico/pico-sdk/src/rp2_common/pico_stdio_uart/include -O2 -flto -fno-data-sections -O3 -DNDEBUG -ffunction-sections -fdata-sections -o CMakeFiles/blink.dir/home/blub/pico/pico-sdk/src/rp2_common/pico_stdio_uart/stdio_uart.c.obj -c /home/blub/pico/pico-sdk/src/rp2_common/pico_stdio_uart/stdio_uart.c Which means it's basically overriding my flags instead of the other way around. Somewhere it's including the override flags before the usual SDK given flags.

Would be nice if at least the C/CXX flags issue can be fixed. I understand if LTO is not happening for a while since it tends to uncover undefined behavior or bugs not seen without but in my experience it does help shrink down code size more.

Full build log: https://gist.github.com/profi200/718e20b9d3ebec87db60f78d0deb994b

kilograham commented 3 years ago

if you use target_compile_definitions or target_compile_optionson your target/library you should get the right thing.

Note that -O2 is the default (for "Release" or "RelWithDebugInfo" CMAKE_BUILD_TYPE), -Og for "Debug", -Os for "MinSizeRel"

First of all thanks for making this so painless. It worked basically right out of the box even with a different arm-none-eabi toolchain.

Thanks, we have tried hard to make things "easy" for the beginner, but give the power user full control. This is quite hard to do (especially when we have both native and device compilation going on). So there are certainly things that are still a little hidden/hard coded

(Note if you search elsewhere you can also configure per source flags with set_source_file_properties.

-flto is on our list to try (but haven't done so yet for anything other than the bootrom). I suspect it will drop stuff on the floor at the moment. Plus we have been thinking with -gc-sections in mind, but that shouldn't make much odds.

Note things you can try to tweak your build config further (if you should so wish) other than just editing what is there.

  1. Sledgehammer: Make a new build type like cmake/preload/toolchains/pico_arm_gcc.cmake and set PICO_COMPILER
  2. clone pico_standard_link library, and either give it a new name and explicitly direct on your version instead (rather than via pico_stdlib), or perhaps more easily, define set(SKIP_PICO_STANDARD_LINK 1) at the top of your CMakeLists.txt then add your own pico_standard_link version after the project init.

Some of this stuff might be a little off the happy path for now, but we are keen to improve.

profi200 commented 3 years ago

Are you sure? It does use -O3 here for release builds. Not sure if this is due to some cmake system default (i definitely didn't set that). Maybe that should be explicitly set in the SDK. And btw i have always been disabling -fdata-sections because it can have a negative impact. I stumbled upon these issues when i noticed even the blink example is 12 KiB in size which is a bit heavy for something that toggles a GPIO + delays.

I have barely any experience with cmake and rather stayed away from it when possible. I guess i need to dive deeper into what i can and can't set in CMakeLists.txt.

Yes, i know the pain. LTO can do insane optimizations that break non-standard conforming or code with undefined behavior in weird ways. Like moving global variable writes completely elsewhere when other functions depend on it.

edit: Oh, and a little side note: Maybe the "-mcpu=cortex-m0plus" should be replaced by "-mtune=cortex-m0plus". gcc is starting to abandon -mcpu on a few platforms.

kripton commented 3 years ago

C/CXX flags I tried to override the C and linker flags to have different optimization than default by doing

cmake .. -DCMAKE_C_FLAGS="-O2 -flto -fno-data-sections" -DCMAKE_EXE_LINKER_FLAGS="-O2 -flto"

The result is this order of flags to the compiler: ...-I/home/blub/pico/pico-sdk/src/rp2_common/pico_stdio_uart/include -O2 -flto -fno-data-sections -O3 -DNDEBUG -ffunction-sections -fdata-sections -o CMakeFiles/blink.dir/home/blub/pico/pico-sdk/src/rp2_common/pico_stdio_uart/stdio_uart.c.obj -c /home/blub/pico/pico-sdk/src/rp2_common/pico_stdio_uart/stdio_uart.c Which means it's basically overriding my flags instead of the other way around. Somewhere it's including the override flags before the usual SDK given flags.

So if I add add_compile_options(-flto) in my project's CMakeLists.txt-file (between project and add_executable statements), it seems to be in effect. All compiler invocations work fine. However, during linking, I get this:

/tmp/cc1KABxt.s: Assembler messages:
/tmp/cc1KABxt.s:11061: Error: invalid offset, value too big (0x000041F4)

For very small projects (aka hello world) it seems to work fine. But for anything useful, it simply fails with the message above or similar.

gcc is:

kripton@momo ~/git/rp2040-compressiontest/build2 $ arm-none-eabi-gcc -v
Using built-in specs.
COLLECT_GCC=arm-none-eabi-gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/arm-none-eabi/11.1.0/lto-wrapper
Target: arm-none-eabi
Configured with: /var/tmp/portage/cross-arm-none-eabi/gcc-11.1.0-r1/work/gcc-11.1.0/configure --host=x86_64-pc-linux-gnu --target=arm-none-eabi --build=x86_64-pc-linux-gnu --prefix=/usr --bindir=/usr/x86_64-pc-linux-gnu/arm-none-eabi/gcc-bin/11.1.0 --includedir=/usr/lib/gcc/arm-none-eabi/11.1.0/include --datadir=/usr/share/gcc-data/arm-none-eabi/11.1.0 --mandir=/usr/share/gcc-data/arm-none-eabi/11.1.0/man --infodir=/usr/share/gcc-data/arm-none-eabi/11.1.0/info --with-gxx-include-dir=/usr/lib/gcc/arm-none-eabi/11.1.0/include/g++-v11 --with-python-dir=/share/gcc-data/arm-none-eabi/11.1.0/python --enable-languages=c,c++ --enable-obsolete --enable-secureplt --disable-werror --with-system-zlib --enable-nls --without-included-gettext --enable-checking=release --with-bugurl=https://bugs.gentoo.org/ --with-pkgversion='Gentoo 11.1.0-r1 p2' --disable-esp --enable-libstdcxx-time --with-build-config=bootstrap-lto --enable-poison-system-directories --disable-libstdcxx-time --with-sysroot=/usr/arm-none-eabi --disable-bootstrap --with-newlib --enable-multilib --disable-fixed-point --disable-libgomp --disable-libssp --disable-libada --disable-systemtap --enable-valgrind-annotations --disable-vtable-verify --disable-libvtv --with-zstd --enable-lto --with-isl --disable-isl-version-check --disable-libsanitizer --disable-default-pie --enable-default-ssp --with-cpu=cortex-m0plus
Thread model: single
Supported LTO compression algorithms: zlib zstd
gcc version 11.1.0 (Gentoo 11.1.0-r1 p2) 

binutils is v2.36.1

kilograham commented 3 years ago

yeah, i think -flto is very free and easy with reordering of code... there are probably some assembly functions that expect to be proximate to each other. Sadly in some cases being the best way for LTO or for GC may not be the same thing, so we may have to special case.

kilograham commented 3 years ago

Note assigning myself, as we do plan to look at LTO in 1.3.0 release

lurch commented 3 years ago

Oh, and a little side note: Maybe the "-mcpu=cortex-m0plus" should be replaced by "-mtune=cortex-m0plus". gcc is starting to abandon -mcpu on a few platforms.

See https://github.com/raspberrypi/pico-sdk/issues/253#issuecomment-798428466 (and #260 )

asumagic commented 3 years ago

If it helps, disabling LTO for pico-sdk/src/rp2_common/hardware_irq/irq.c solves the "invalid offset, value too big" error in my case.

The linker however then spits out a ton of undefined references related to __wrap_ functions, which pointed me to these few bug reports, which may be relevant:

I tried using the gold linker, but it seems to dislike the linker scripts, and spits out some errors about a COPY section type: pico-sdk/src/rp2_common/pico_standard_link/memmap_default.ld:123:23: COPY section type is unsupported. According to the source, only NOLOAD is supported. I tried replacing the (COPY)s with (NOLOAD)s, which I assume is a bad idea, and... ended up triggering an internal error elsewhere anyway.

I tried using lld, but -fuse-ld=lld causes an error because on my system it looks for /usr/bin/arm-none-eabi-ld.lld. Symlinking /usr/bin/lld (which I didn't really expect to work anyway) fails with:

[build] arm-none-eabi-ld.lld: error: no memory region specified for section '.data'
[build] arm-none-eabi-ld.lld: error: no memory region specified for section '.bss'

... so I'm assuming figuring out what to do with the --wrap issue and fixing it would be the next step to get a proof of concept compilation with LTO.

... Good luck with this :)

asumagic commented 3 years ago

Looks like I got a bit further after all. In platform.h, I modified WRAPPER_FUNC to be defined as such:

#define WRAPPER_FUNC(x) __attribute__((used)) __wrap_ ## x

Now it successfully compiles and run.

Compared to having LTO activated for everything except the SDK (assuming I didn't mess up something), the generated .bin file is slightly larger, but I suspect this is may be due by a few things:

I have not tested for performance, however.

(EDIT: as an interesting note, setting PICO_NO_GC_SECTIONS while LTO is active results in a significantly larger executable. I'm assuming that LTO cannot eliminate functions as much as the linker itself can. Does --gc-functions affect functions tagged with __attribute__((used))?)

kilograham commented 3 years ago

Thanks; ugh i wasn't aware of the --wrap/LTO issue

asumagic commented 3 years ago

While everything seems to run fine with my workarounds applied and I have not encountered any breakage, I have realized something (which in retrospect is rather explicit in the reports I've linked to): gcc+bfd with LTO could inline some of the original, non-wrapped functions.

I have probably not encountered issues because function wraps that may have mattered were not inlined and the call site still referred to the symbol that got wrapped later on.
That may rather likely be because most wrapped functions are compiled in the newlib static archive (or whatever it is), which is not built with LTO on my platform... but it could be on another, from what I can tell.

This might be worth investigating further, because I'm not quite sure if that's happening for me here.

Interestingly, using x86-64 desktop linux compilers and a simple test case:

$ cat main.c 
#include "orig_decl.h"

int main()
{
        foo();
}
$ cat orig_decl.c
#include <stdio.h>

void foo()
{
        puts("foo()");
}
$ cat wrap.c
#include "orig_decl.h"

#include <stdio.h>

void __wrap_foo()
{
        puts("__wrap_foo()");
}

... only clang+lld properly emitted a call to __wrap_foo! gcc+bfd, gcc+gold, gcc+lld (doesn't seem to be a supported combination for LTO at all), clang+bfd, clang+gold all emitted a call directly to foo...

That being said, with clang thin LTO (-flto=thin), clang+lld AND clang+gold work. I assume this is a difference in the implementation of the LTO plugins for each of the linkers. I feel like that would be in a platform-independent manner, but this may be worth looking if the exact same behavior occurs when targeting armv6.

Rather disappointingly, in the above tests, __wrap_foo() never gets inlined into main (probably because wrapping happens after LTO is performed). foo() gets eliminated only with --gc-sections.

Unfortunately, I'm quite not sure what the best approach would be. Since clang support is in the same milestone as this, it might be worth trying to support clang LTO with lld only first if it turns out to be simpler to support - or use an approach different from --wrap, which here seems very much non-trivial, if possible at all.

(FWIW, LTO with any linker but bfd would be a significantly better developer experience due to that usecase being really slow due to lack of parallelization from what I can tell)

niansa commented 10 months ago

This is such a simple fix, why after almost 3 years hasn't it been implemented properly yet? I guess I could give it a go...

Edit: Though the retain attribute may be a better fit: https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-retain-function-attribute

asumagic commented 10 months ago

I am not up to date, but unless pico-sdk stopped using wrap functions, I do not think the problem I described is gone. In particular, gcc+bfd LTO is insidiously broken.

niansa commented 10 months ago

Adding that attribute to all wrap functions as a proof of concept fixed it for me.

asumagic commented 10 months ago

Do make sure that this solution does not cause the issue that I mentioned above (i.e. that inlining may inline the original function rather than the wrapper). Depending on the wrapped function, there may not be a noticeable side-effect for your usecase (it has been fine for me), but it would likely not be upstreamable.

kripton commented 7 months ago

Adding that attribute to all wrap functions as a proof of concept fixed it for me.

do you have a fork of the pico-sdk where we could try that?

asumagic commented 7 months ago

Adding that attribute to all wrap functions as a proof of concept fixed it for me.

do you have a fork of the pico-sdk where we could try that?

This change would be a one-liner afaict, but beware the caveats I mentioned in my replies above.

https://github.com/raspberrypi/pico-sdk/commit/7f494d0a194ddbf73713dff482c5e01e5e878a7a

asumagic commented 3 months ago

There was some activity in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88643 and it appears that the wrapping issue should be resolved with ld as of binutils 2.43. My distro doesn't package it yet for arm-none-eabi, though, so I'll see if I can try this easily myself shortly.

EDIT: no dice, still getting undefined references to __wrap_printf and such. Although the __attribute__((used)) still works (but might still be broken), as long as you set PICO_STDIO_SHORT_CIRCUIT_CLIB_FUNCS=0 (unless patching the pico-sdk anyway).