Evaluation of LTO configuration for all targets, and its impact on build time, build size, and performance

akien-mga commented 2 months ago

For years we've operated under the assumption that LTO (Link Time Optimization) is a net positive for production builds as it would:

Increase performance (notably up to 20% in the GDScript VM with GCC LTO)
Reduce build size

The drawback is much longer build times, hence why it's only used for production builds/official releases.

Now findings in #96785 suggest that the reduction in build size is only true for GCC's LTO, and not for LLVM LTO (whether "full" LTO -flto or ThinLTO -flto=thin). With LLVM LTO there's a significant size increase for platforms we tested so far (Web, Android, Linux) of up to +15%. For the Web (currently using LTO for official builds) and Android (not using it for now) this is significant.

So it's time we do a thorough review of build flags for all targets and compilers and make sure we're actually using the best configuration possible for official builds.

I'll post successive replies for each Godot target platform so we can use these posts (maintainers are welcome to edit my posts) to keep track of metrics and findings for each platform individually. If that turns out to be too unwieldy we can fork this issue in one issue per platform, but I expect we'll find closely related behavior across platforms who share a compiler toolchain (GCC, LLVM, MSVC).

@godotengine/buildsystem @godotengine/android @godotengine/ios @godotengine/linux-bsd @godotengine/macos @godotengine/web @godotengine/windows

akien-mga commented 2 months ago

Android

Toolchains:

Android NDK (LLVM)

akien-mga commented 2 months ago

iOS

Toolchains:

Xcode (LLVM)

akien-mga commented 2 months ago

Linux

Toolchains:

GCC (official builds toolchain)
LLVM

akien-mga commented 2 months ago

macOS

Toolchains:

Xcode (LLVM)

Apple clang version 15.0.0 (clang-1500.3.9.4)
Target: arm64-apple-darwin23.6.0

Release template

scons target=template_release arch=arm64 platform=macos production=yes lto=*

LTO	Build Time	Peak memory usage	Executable size
none	7:10	sub 1G	68.082.840
thin	9:45	~ 2.5G	74.674.240
full	19:26	~ 12G [^1]	66.936.680

[^1]: Mostly around 6G with a spike at the end of linking.

Debug template

scons target=template_debug arch=arm64 platform=macos production=yes lto=*

LTO	Build Time	Peak memory usage	Executable size
none	9:51	sub 1G	71.861.672
thin	13:28	~ 2.5G	84.408.856
full	42:52 [^2]	~ 18G	74.752.334

[^2]: A lot of swap usage, so time is not directly comparable.

akien-mga commented 2 months ago

Web

Toolchains:

Emscripten (LLVM)

akien-mga commented 2 months ago

Windows

Toolchains:

MSVC cl.exe (MSVC)

MSVC clang-cl.exe (LLVM)

Release template

scons target=template_release production=yes use_llvm=yes lto=*

LTO	Build Time	Executable size
none	04:09.81	58,270,720
thin	04:46.66	68,056,064
full	N/A[^1]	N/A

[^1]: Attempted to build for ~20 minutes before erroring out.

Debug template

scons target=template_debug production=yes use_llvm=yes lto=*

LTO	Build Time	Executable size
none	04:12.88	73,480,704
thin	05:03.34	86,334,976
full	N/A	N/A

mingw-gcc (GCC) (official builds toolchain for x86_64 / x86_32)

llvm-mingw (LLVM) (official builds toolchain for arm64)

Release template

scons target=template_release production=yes use_llvm=yes use_mingw=yes lto=*

LTO	Build Time	Executable size
none	04:36.49	63,627,776
thin	04:55.96	73,898,496
full	14:38.60	70,736,896

Debug template

scons target=template_debug production=yes use_llvm=yes use_mingw=yes lto=*

LTO	Build Time	Executable size
none	04:37.85	68,219,392
thin	05:14.51	79,650,304
full	15:46.82	76,373,504

lawnjelly commented 2 months ago

Something to bear in mind with LTO :

SCU builds will likely get the lions share of the benefit, without needing LTO. This is because they push a bunch of files into the same translation unit, which means that the compiler can optimize across cpps (which afaik is what LTO offers, the more convoluted way around).

We so far haven't used them in production, but it's worth mentioning as an alternative (no idea one how their size compares in release, or performance).

Calinou commented 2 months ago

We so far haven't used them in production, but it's worth mentioning as an alternative (no idea one how their size compares in release, or performance).

Using SCU builds for fully optimized release builds can need a lot of RAM (I've measured 22 GB for the build process alone on Linux x86_64), so this is to keep in mind. That said, the release build server has plenty of RAM to spare.

dustdfg commented 3 weeks ago

SCU builds will likely get the lions share of the benefit, without needing LTO. This is because they push a bunch of files into the same translation unit, which means that the compiler can optimize across cpps (which afaik is what LTO offers, the more convoluted way around).

scons target="editor" use_llvm="yes" lto="none" I've just ran two builds. One with SCU and another without. Both with LLVM and without LTO.

SCU build: 121,314,392 bytes
Non SCU build: 120,724,664 bytes

Performance impact: ??? (Didn't test) Size difference: ~6KB

Godot's SCU is not one creating one big file from all the files but just gluing files into bigger files but still produces many files not one big... Not to mention that lots of files are build as usually even with SCU build.

At the same time LTO is performed on final executable (on "all" the files at once). So in general SCU can't compete with LTO. While SCU possible gives some performance impact I think it is negligible though I didn't test performace

dustdfg commented 3 weeks ago

We so far haven't used them in production, but it's worth mentioning as an alternative (no idea one how their size compares in release, or performance).

Using SCU builds for fully optimized release builds can need a lot of RAM (I've measured 22 GB for the build process alone on Linux x86_64), so this is to keep in mind. That said, the release build server has plenty of RAM to spare.

I have a low-end device so I have only 4 threads. RAM usage greatly depends on amount of parallel threads. I saw peaks at 6GB with SCU (part of it is firefox ~1.4GB)

If you are going to build with SCU only release builds provided to user (I mean end user who compiles custom template for game). I think it is enough bearable to use less threads to use less RAM. So if SCU can really give impact, it'd be reasonable to mention SCU as a tool for optimization

godotengine / godot