skeeto / w64devkit

Portable C and C++ Development Kit for x64 (and x86) Windows
The Unlicense
3.21k stars 226 forks source link

Question: Why was LTO disabled in version 1.22.0? #124

Open tapir2342 opened 7 months ago

tapir2342 commented 7 months ago

Hi @skeeto , thanks for making this. Can you explain (or point me to resources) why LTO got disabled in version 1.22.0? Thank you.

skeeto commented 7 months ago

What pushed me back in January (103b45db) was ld erroneously introducing _pei386_runtime_relocator references in certain (admittedly unusual) circumstances, and I had to use "-fno-lto" to make it behave itself. I don't personally feel LTO is all that valuable, and now it was actively getting in my way, so I finally disabled it. I was pleasantly surprised to see this also knocked ~9MiB off the distribution, which created even more incentive to keep the change.

I wanted to provide an example, but I couldn't reproduce my issue from January in w64devkit 1.21.0. I think now that I was experiencing a bug specifically in Binutils 2.41 (99268966). If you re-enable LTO:

$ sed -i /--disable-lto/d Dockerfile

Build a toolchain, then try to extract, say, ___chkstk_ms from libgcc.a using partial linking:

$ cc -r -u ___chkstk_ms -o chkstk.o -lgcc $ nm chkstk.o

You'll see "U _pei386_runtime_relocator" in the nm listing. That shouldn't be there. Add "-fno-lto" and it goes away, or disable LTO in the toolchain like I did.

The Windows ports of the GNU ecosystem tend to be second class, and the less-trodden features, such as LTO, receive less testing. It's certainly had its sharp edges in the Mingw-w64 ecosystem:

https://sourceforge.net/p/mingw-w64/mailman/mingw-w64-public/thread/20230401111018.qn3vrsbgn7mfv3ge@pali/ (EDIT: Had to fix this link. GitHub's email support is so aggressively mediocre.)

Along these lines, if you build the latest u-config (including the one included in w64devkit 1.22) using the w64devkit 1.21 toolchain with LTO enabled, you'll get a broken binary without warning:

$ cc -nostartfiles -flto win32_main.c $ ./a SEGV

You'd need to also add a not-obvious "-u mainCRTStartup" to the command. This was fixed in Binutils 2.41, so the above command will work correctly in an LTO-restored build of w64devkit, but my overall confidence hasn't raised much.

Perhaps LTO is valuable in huge C++ projects to deduplicate piles of template instantiations, or projects that break hot paths across multiple translation units (preventing inlining). The typical case appears to be zero or little (a few percentage points) performance improvement from LTO.

Peter0x44 commented 7 months ago

@skeeto LTO's biggest benefit seems to be not so much speed, but code size. It makes a bigger difference than you'd expect in that regard, there are some benchmarks here: https://youtu.be/GufwdypTfrE

Exposing you to more compiler bugs is unfortunately true, and even more so for *-w64-mingw32, though. I wanted to build cppcheck with LTO as an example and encountered this: https://gcc.gnu.org/PR106103

It's supposedly fixed, but I'd still like to revisit when gcc 14.1 is released.

I still think enabling it in the toolchain is overall a good idea. And potentially it's a good avenue to reduce some of w64devkit's size.

skeeto commented 7 months ago

Thanks, @Peter0x44, that was an interesting talk. My main takeaways from the talk:

(Unfortunately, as I'm sure you already realize, PGO isn't going to be practical when building w64devkit itself due to cross-compilation.)

If I re-enable LTO per the sed command above, then build Cppcheck with LTO using that toolchain, I get a 10% size reduction (~300K). I don't find this particularly impressive, especially for how much it costs (125% build time increase, an extra 9M of toolchain distributed, 22M installed). I tried again with my GCC 14 snapshot branch, same results. Also, that LTO bug is not fixed as of the April 5th GCC 14 snapshot, so I still needed the declone option.

If you'd like to reproduce this yourself, I used w64devkit's cppcheck.mak with these changes, on Cppcheck 2.10:

--- a/cppcheck.mak
+++ b/cppcheck.mak
@@ -3,5 +3,5 @@
 obj      := $(src:.cpp=.o)
-CXXFLAGS := -w -Os -Ilib $(addprefix -I,$(ext))
+CXXFLAGS := -w -Os -Ilib -flto -fno-declone-ctor-dtor $(addprefix -I,$(ext))
 cppcheck.exe: $(obj)
-   $(CXX) -s -o $@ $(obj) -lshlwapi
+   $(CXX) -s -o $@ -Os -flto=auto $(obj) -lshlwapi
 cppcheck: $(obj)

Disabling LTO is a bit of experiment. It sat on the master branch for over two months without objections, so I felt comfortable trying it in a release. I could be persuaded it's worth reverting back to the default, especially as LTO-related bugs are fixed. but I'm not there yet.

Peter0x44 commented 7 months ago

Perhaps bigger gains can be achieved when building gcc or potentially busybox-w32 with it also. It's on my to-do list to investigate the potential benefits there.

rmyorston commented 7 months ago

For some time now (six years or so?) my release builds of busybox-w32 have used LTO. It makes the binaries smaller and doesn't seem to have resulted in any issues.

The more recent clang/aarch64 build doesn't use LTO as it made the binary slightly larger. By 0.2%. Oh no!

skeeto commented 7 months ago

I've decided for now to continue with LTO still disabled in the 1.23.0 release today.

Peter0x44 commented 7 months ago

@rmyorston aarch64-w64-mingw32 support was recently merged for gcc 15, perhaps it's worth investigating if it would reduce the executable size.

Maybe it's worth considering for w64devkit also, but there are few WoA devices that can be purchased, and gcc won't support it officially until next year, I wouldn't put it on a high priority. Just something to be aware of.

MrMadguy64 commented 6 months ago

Please return LTO as soon, as possible. I develop project for DOS and code size is very important for me. I've googled this problem and it seems like there is some work in progress to fix it. For now it's recommended to mark _pei386_runtime_relocator as used.

Peter0x44 commented 6 months ago

Another thought for LTO, but I don't know the practical implications of it yet. The system compiler of arch has: Supported LTO compression algorithms: zlib zstd w64devkit only has zlib. Perhaps it's worth compiling zstd and letting gcc use it when lto is deemed worth reintroducing, but that has its own binary size concerns, and also the question of whether it's even useful.

N-R-K commented 5 months ago

Supported LTO compression algorithms

That's just for the intermediate object files isn't it? I don't think it should have any effect on the final binary (where lto information gets stripped).

R-Goc commented 2 months ago

Could this be reenabled? This breaks existing build configurations from performance optimized libraries.

ShawSumma commented 2 months ago

This broke my build of MiniVM. It was a simple enough fix, but slowed down the VM's GC significantly.

Peter0x44 commented 2 weeks ago

I wanted to experiment with this again, so I went to build an lto-enabled w64devkit for my own usage, but it had some minor problems. Simply doing:

$ sed -i /--disable-lto/d Dockerfile

Resulted in ar being broken for LTO usages.

$ cat square.c
int square(int x) { return x*x; }
$ cat test.c
#include <stdio.h>

int square(int);

int main(void)
{
        printf("%d squared is %d", 4, square(4));
}
$ gcc -flto -c square.c
$ ar rcs  libsquare.a square.o
ar: square.o: plugin needed to handle lto object

libsquare.a was still created, but linking it did not work

$ gcc test.c libsquare.a
C:/programming/toolchains/w64devkit/bin/ld.exe: C:\Users\peter\AppData\Local\Temp\ccUerGLN.o:test.c:(.text+0x67): undefined reference to `square'
collect2.exe: error: ld returned 1 exit status

gcc-ar however, did work correctly.

What fixed it was copying w64devkit\libexec\gcc\x86_64-w64-mingw32\14.2.0\liblto_plugin.dll to w64devkit\lib\bfd-plugins\liblto_plugin.dll

I suspect this has to do with the strange way gcc is configured by w64devkit https://github.com/skeeto/w64devkit/blob/8ea0fee87c88b27316ffcf9e0d1ef0caa4cc8a91/Dockerfile#L256-L257 with the sysroot having $ARCH at the end of it

I did not have time to test any theories relating to this, but perhaps @skeeto might have some ideas. I haven't found this to be necessary in my own experiments building mingw-w64 toolchains outside of w64devkit.