Open tapir2342 opened 7 months ago
What pushed me back in January (103b45db) was ld erroneously introducing _pei386_runtime_relocator references in certain (admittedly unusual) circumstances, and I had to use "-fno-lto" to make it behave itself. I don't personally feel LTO is all that valuable, and now it was actively getting in my way, so I finally disabled it. I was pleasantly surprised to see this also knocked ~9MiB off the distribution, which created even more incentive to keep the change.
I wanted to provide an example, but I couldn't reproduce my issue from January in w64devkit 1.21.0. I think now that I was experiencing a bug specifically in Binutils 2.41 (99268966). If you re-enable LTO:
$ sed -i /--disable-lto/d Dockerfile
Build a toolchain, then try to extract, say, ___chkstk_ms from libgcc.a using partial linking:
$ cc -r -u ___chkstk_ms -o chkstk.o -lgcc $ nm chkstk.o
You'll see "U _pei386_runtime_relocator" in the nm listing. That shouldn't be there. Add "-fno-lto" and it goes away, or disable LTO in the toolchain like I did.
The Windows ports of the GNU ecosystem tend to be second class, and the less-trodden features, such as LTO, receive less testing. It's certainly had its sharp edges in the Mingw-w64 ecosystem:
https://sourceforge.net/p/mingw-w64/mailman/mingw-w64-public/thread/20230401111018.qn3vrsbgn7mfv3ge@pali/ (EDIT: Had to fix this link. GitHub's email support is so aggressively mediocre.)
Along these lines, if you build the latest u-config (including the one included in w64devkit 1.22) using the w64devkit 1.21 toolchain with LTO enabled, you'll get a broken binary without warning:
$ cc -nostartfiles -flto win32_main.c $ ./a SEGV
You'd need to also add a not-obvious "-u mainCRTStartup" to the command. This was fixed in Binutils 2.41, so the above command will work correctly in an LTO-restored build of w64devkit, but my overall confidence hasn't raised much.
Perhaps LTO is valuable in huge C++ projects to deduplicate piles of template instantiations, or projects that break hot paths across multiple translation units (preventing inlining). The typical case appears to be zero or little (a few percentage points) performance improvement from LTO.
@skeeto LTO's biggest benefit seems to be not so much speed, but code size. It makes a bigger difference than you'd expect in that regard, there are some benchmarks here: https://youtu.be/GufwdypTfrE
Exposing you to more compiler bugs is unfortunately true, and even more so for *-w64-mingw32, though. I wanted to build cppcheck with LTO as an example and encountered this: https://gcc.gnu.org/PR106103
It's supposedly fixed, but I'd still like to revisit when gcc 14.1 is released.
I still think enabling it in the toolchain is overall a good idea. And potentially it's a good avenue to reduce some of w64devkit's size.
Thanks, @Peter0x44, that was an interesting talk. My main takeaways from the talk:
(Unfortunately, as I'm sure you already realize, PGO isn't going to be practical when building w64devkit itself due to cross-compilation.)
If I re-enable LTO per the sed
command above, then build Cppcheck with LTO using that toolchain, I get a 10% size reduction (~300K). I don't find this particularly impressive, especially for how much it costs (125% build time increase, an extra 9M of toolchain distributed, 22M installed). I tried again with my GCC 14 snapshot branch, same results. Also, that LTO bug is not fixed as of the April 5th GCC 14 snapshot, so I still needed the declone option.
If you'd like to reproduce this yourself, I used w64devkit's cppcheck.mak
with these changes, on Cppcheck 2.10:
--- a/cppcheck.mak
+++ b/cppcheck.mak
@@ -3,5 +3,5 @@
obj := $(src:.cpp=.o)
-CXXFLAGS := -w -Os -Ilib $(addprefix -I,$(ext))
+CXXFLAGS := -w -Os -Ilib -flto -fno-declone-ctor-dtor $(addprefix -I,$(ext))
cppcheck.exe: $(obj)
- $(CXX) -s -o $@ $(obj) -lshlwapi
+ $(CXX) -s -o $@ -Os -flto=auto $(obj) -lshlwapi
cppcheck: $(obj)
Disabling LTO is a bit of experiment. It sat on the master branch for over two months without objections, so I felt comfortable trying it in a release. I could be persuaded it's worth reverting back to the default, especially as LTO-related bugs are fixed. but I'm not there yet.
Perhaps bigger gains can be achieved when building gcc or potentially busybox-w32 with it also. It's on my to-do list to investigate the potential benefits there.
For some time now (six years or so?) my release builds of busybox-w32 have used LTO. It makes the binaries smaller and doesn't seem to have resulted in any issues.
The more recent clang/aarch64 build doesn't use LTO as it made the binary slightly larger. By 0.2%. Oh no!
I've decided for now to continue with LTO still disabled in the 1.23.0 release today.
@rmyorston aarch64-w64-mingw32 support was recently merged for gcc 15, perhaps it's worth investigating if it would reduce the executable size.
Maybe it's worth considering for w64devkit also, but there are few WoA devices that can be purchased, and gcc won't support it officially until next year, I wouldn't put it on a high priority. Just something to be aware of.
Please return LTO as soon, as possible. I develop project for DOS and code size is very important for me. I've googled this problem and it seems like there is some work in progress to fix it. For now it's recommended to mark _pei386_runtime_relocator as used.
Another thought for LTO, but I don't know the practical implications of it yet. The system compiler of arch has:
Supported LTO compression algorithms: zlib zstd
w64devkit only has zlib. Perhaps it's worth compiling zstd and letting gcc use it when lto is deemed worth reintroducing, but that has its own binary size concerns, and also the question of whether it's even useful.
Supported LTO compression algorithms
That's just for the intermediate object files isn't it? I don't think it should have any effect on the final binary (where lto information gets stripped).
Could this be reenabled? This breaks existing build configurations from performance optimized libraries.
This broke my build of MiniVM. It was a simple enough fix, but slowed down the VM's GC significantly.
I wanted to experiment with this again, so I went to build an lto-enabled w64devkit for my own usage, but it had some minor problems. Simply doing:
$ sed -i /--disable-lto/d Dockerfile
Resulted in ar
being broken for LTO usages.
$ cat square.c
int square(int x) { return x*x; }
$ cat test.c
#include <stdio.h>
int square(int);
int main(void)
{
printf("%d squared is %d", 4, square(4));
}
$ gcc -flto -c square.c
$ ar rcs libsquare.a square.o
ar: square.o: plugin needed to handle lto object
libsquare.a was still created, but linking it did not work
$ gcc test.c libsquare.a
C:/programming/toolchains/w64devkit/bin/ld.exe: C:\Users\peter\AppData\Local\Temp\ccUerGLN.o:test.c:(.text+0x67): undefined reference to `square'
collect2.exe: error: ld returned 1 exit status
gcc-ar
however, did work correctly.
What fixed it was copying w64devkit\libexec\gcc\x86_64-w64-mingw32\14.2.0\liblto_plugin.dll
to w64devkit\lib\bfd-plugins\liblto_plugin.dll
I suspect this has to do with the strange way gcc is configured by w64devkit https://github.com/skeeto/w64devkit/blob/8ea0fee87c88b27316ffcf9e0d1ef0caa4cc8a91/Dockerfile#L256-L257 with the sysroot having $ARCH at the end of it
I did not have time to test any theories relating to this, but perhaps @skeeto might have some ideas. I haven't found this to be necessary in my own experiments building mingw-w64 toolchains outside of w64devkit.
Hi @skeeto , thanks for making this. Can you explain (or point me to resources) why LTO got disabled in version 1.22.0? Thank you.