godotengine / godot

Godot Engine – Multi-platform 2D and 3D game engine
https://godotengine.org
MIT License
91.34k stars 21.25k forks source link

Evaluation of LTO configuration for all targets, and its impact on build time, build size, and performance #96851

Open akien-mga opened 2 months ago

akien-mga commented 2 months ago

For years we've operated under the assumption that LTO (Link Time Optimization) is a net positive for production builds as it would:

The drawback is much longer build times, hence why it's only used for production builds/official releases.

Now findings in #96785 suggest that the reduction in build size is only true for GCC's LTO, and not for LLVM LTO (whether "full" LTO -flto or ThinLTO -flto=thin). With LLVM LTO there's a significant size increase for platforms we tested so far (Web, Android, Linux) of up to +15%. For the Web (currently using LTO for official builds) and Android (not using it for now) this is significant.

So it's time we do a thorough review of build flags for all targets and compilers and make sure we're actually using the best configuration possible for official builds.

I'll post successive replies for each Godot target platform so we can use these posts (maintainers are welcome to edit my posts) to keep track of metrics and findings for each platform individually. If that turns out to be too unwieldy we can fork this issue in one issue per platform, but I expect we'll find closely related behavior across platforms who share a compiler toolchain (GCC, LLVM, MSVC).

@godotengine/buildsystem @godotengine/android @godotengine/ios @godotengine/linux-bsd @godotengine/macos @godotengine/web @godotengine/windows

akien-mga commented 2 months ago

Android

Toolchains:

akien-mga commented 2 months ago

iOS

Toolchains:

akien-mga commented 2 months ago

Linux

Toolchains:

akien-mga commented 2 months ago

macOS

Toolchains:

Apple clang version 15.0.0 (clang-1500.3.9.4)
Target: arm64-apple-darwin23.6.0

Release template

scons target=template_release arch=arm64 platform=macos production=yes lto=*
LTO Build Time Peak memory usage Executable size
none 7:10 sub 1G 68.082.840
thin 9:45 ~ 2.5G 74.674.240
full 19:26 ~ 12G [^1] 66.936.680

[^1]: Mostly around 6G with a spike at the end of linking.

Debug template

scons target=template_debug arch=arm64 platform=macos production=yes lto=*
LTO Build Time Peak memory usage Executable size
none 9:51 sub 1G 71.861.672
thin 13:28 ~ 2.5G 84.408.856
full 42:52 [^2] ~ 18G 74.752.334

[^2]: A lot of swap usage, so time is not directly comparable.

akien-mga commented 2 months ago

Web

Toolchains:

akien-mga commented 2 months ago

Windows

Toolchains:

LTO Build Time Executable size
none 04:09.81 58,270,720
thin 04:46.66 68,056,064
full N/A[^1] N/A

[^1]: Attempted to build for ~20 minutes before erroring out.

Debug template

scons target=template_debug production=yes use_llvm=yes lto=*
LTO Build Time Executable size
none 04:12.88 73,480,704
thin 05:03.34 86,334,976
full N/A N/A

LTO Build Time Executable size
none 04:36.49 63,627,776
thin 04:55.96 73,898,496
full 14:38.60 70,736,896

Debug template

scons target=template_debug production=yes use_llvm=yes use_mingw=yes lto=*
LTO Build Time Executable size
none 04:37.85 68,219,392
thin 05:14.51 79,650,304
full 15:46.82 76,373,504
lawnjelly commented 2 months ago

Something to bear in mind with LTO :

SCU builds will likely get the lions share of the benefit, without needing LTO. This is because they push a bunch of files into the same translation unit, which means that the compiler can optimize across cpps (which afaik is what LTO offers, the more convoluted way around).

We so far haven't used them in production, but it's worth mentioning as an alternative (no idea one how their size compares in release, or performance).

Calinou commented 2 months ago

We so far haven't used them in production, but it's worth mentioning as an alternative (no idea one how their size compares in release, or performance).

Using SCU builds for fully optimized release builds can need a lot of RAM (I've measured 22 GB for the build process alone on Linux x86_64), so this is to keep in mind. That said, the release build server has plenty of RAM to spare.

dustdfg commented 3 weeks ago

SCU builds will likely get the lions share of the benefit, without needing LTO. This is because they push a bunch of files into the same translation unit, which means that the compiler can optimize across cpps (which afaik is what LTO offers, the more convoluted way around).

scons target="editor" use_llvm="yes" lto="none" I've just ran two builds. One with SCU and another without. Both with LLVM and without LTO.

Performance impact: ??? (Didn't test) Size difference: ~6KB

Godot's SCU is not one creating one big file from all the files but just gluing files into bigger files but still produces many files not one big... Not to mention that lots of files are build as usually even with SCU build.

At the same time LTO is performed on final executable (on "all" the files at once). So in general SCU can't compete with LTO. While SCU possible gives some performance impact I think it is negligible though I didn't test performace

dustdfg commented 3 weeks ago

We so far haven't used them in production, but it's worth mentioning as an alternative (no idea one how their size compares in release, or performance).

Using SCU builds for fully optimized release builds can need a lot of RAM (I've measured 22 GB for the build process alone on Linux x86_64), so this is to keep in mind. That said, the release build server has plenty of RAM to spare.

I have a low-end device so I have only 4 threads. RAM usage greatly depends on amount of parallel threads. I saw peaks at 6GB with SCU (part of it is firefox ~1.4GB)

If you are going to build with SCU only release builds provided to user (I mean end user who compiles custom template for game). I think it is enough bearable to use less threads to use less RAM. So if SCU can really give impact, it'd be reasonable to mention SCU as a tool for optimization