godotengine / godot-proposals

Godot Improvement Proposals (GIPs)
MIT License
1.07k stars 69 forks source link

Enable SSE 4.2 in Embree to improve occlusion culling (and CPU lightmapper) performance #3932

Open Calinou opened 2 years ago

Calinou commented 2 years ago

Describe the project you are working on

The Godot editor :slightly_smiling_face:

Describe the problem or limitation you are having in your project

Godot makes use of Embree for raster occlusion culling and baking lightmaps on the CPU. Embree supports a vast array of CPU feature sets which can be enabled to improve performance (at the cost of compatibility).

Right now, Godot uses the lowest baseline which is SSE2. Pretty much any CPU released in the last 15 years supports it, which means that even Intel Core 2 Duo-based systems can run Godot's master branch (assuming they are coupled with a recent enough GPU).

Note: This proposal only targets Godot 4.x, not Godot 3.x. This proposal also doesn't affect ARM architecture builds or HTML5.

Describe the feature / enhancement and how it helps to overcome the problem or limitation

Use SSE 4.2 as a baseline in Embree on x86_64.

This baseline would also allow for enabling SSE 4.2 optimizations in the C++ compiler used to build official Godot binaries.

On 32-bit x86, only SSE and SSE2 should be required (as is done currently).

Describe how your proposal will work, with code, pseudo-code, mock-ups, and/or diagrams

This change can be carried out here by replacing EMBREE_TARGET_SSE2 with EMBREE_TARGET_SSE42: https://github.com/godotengine/godot/blob/d64b27e510d3bda72730717f1bce0f5a3470a689/modules/raycast/SCsub#L66

GCC flags in SCons files for Windows (MinGW) and macOS should be changed from -msse2 to -msse4.2. On Linux, the flag should be added as it's not currently present (it uses the architecture default).

After making this change and recompiling with target=release, elfx86exts can be used to on Linux confirm that binaries make use of the new instruction sets as expected.


On the Intel side, the first generation to support this is Nehalem, with the first CPU released in Q4 2008. Unlike AVX and AVX2, there is no market segmentation for SSE4.2 that I know of – even the lowest-end Celeron from Sandy Bridge still has SSE4.2 support.

On the AMD side, the first generation to support this is Bulldozer (only on the FX series), with the first CPU released in Q4 2011. APUs from the Bulldozer generation do not support SSE 4.2 though.

Alternatively, we could compile both SSE2 and SSE4.2 code paths into a single binary, but this requires additional research. It would also make binaries larger (likely +1 MB, if not more). On the bright side, this would allow for supporting additional CPU feature sets, such as AVX and AVX2 that are supported on most modern CPUs (but are too recent to work as a baseline).

Many games and applications now require SSE 4.2 as a baseline. While this is usually fine, there are still CPUs in use that don't support SSE 4.2 – mainly AMD APUs sold between 2011 and 2013. From a web search, there are still people running into trouble because of this.

However, it should be kept in mind that Godot 4.0 is unlikely to run on those machines as it is, since you'll need a GPU that supports Vulkan. It's very likely that such a GPU will be paired with a CPU that supports SSE4.2. Only Godot 4.1 will feature a production-ready OpenGL renderer, but its release will likely have to wait until H2 2023. Until then, users on old machines will keep using Godot 3.x, which will keep its SSE2 baseline as mentioned above.

I think that by H2 2023 (when most users on old machines will upgrade to Godot 4.1), such old (and generally low-end) APUs are most likely not going to be used anymore. Therefore, SSE 4.2 will be present on pretty much any x86 machine still in use for playing and developing games.

If this enhancement will not be used often, can it be worked around with a few lines of script?

No, as this is about changing build-time options for official editor and export template binaries.

Is there a reason why this should be core and not an add-on in the asset library?

This is about changing build-time options for official editor and export template binaries.

SoyoTamo commented 2 years ago

Alternatively, we could compile both SSE2 and SSE4.2 code paths into a single binary, but this requires additional research. It would also make binaries larger (likely +1 MB, if not more). On the bright side, this would allow for supporting additional CPU feature sets, such as AVX and AVX2 that are supported on most modern CPUs.

I personally support progress, but I think this option would be the fairest for everyone. I think I could not get a new PC for this or next year and I think many would be in the same situation, 1 or 2 MB is a reasonable price to pay to give many people the opportunity to try 4.1.

Calinou commented 2 years ago

I started working on an implementation of this: https://github.com/Calinou/godot/tree/scons-use-sse4.2 It currently doesn't leave 32-bit x86 alone – this should be changed before a PR can be opened. Edit: Now affects x86_64 only.

elfx86exts reports for old and new release export templates:

Instructions in the binary

Current

❯ elfx86exts godot.linuxbsd.opt.64
MODE64 (call)
CMOV (cmovle)
SSE1 (movss)
SSE2 (pxor)
BMI (tzcnt)
MMX (movq)
AES (aesenc)
PCLMUL (pclmulqdq)
BMI2 (shlx)
CPU Generation: Haswell

With the above branch

❯ elfx86exts godot.linuxbsd.opt.64.sse4.2 
MODE64 (call)
CMOV (cmovle)
SSE1 (movss)
SSE2 (pxor)
SSE41 (pmaxsd)
SSSE3 (pshufb)
SSE3 (movddup)
BMI (tzcnt)
MMX (movq)
SSE42 (pcmpgtq)
AES (aesenc)
PCLMUL (pclmulqdq)
BMI2 (shlx)
CPU Generation: Unknown

Binary sizes are almost identical, with the SSE4.2-enabled export template being 4 KB smaller when comparing the size of both binaries stripped.

Benchmark

The testing project instances 500 RigidDynamicBody3D nodes and is quit as fast as possible: test_sse4.2.zip

❯ hyperfine -iw1 "bin/godot.linuxbsd.opt.64.stripped --path ~/Documents/Godot/test_sse4.2 --quit" "bin/godot.linuxbsd.opt.64.sse4.2.stripped --path ~/Documents/Godot/test_sse4.2 --quit"
Benchmark #1: bin/godot.linuxbsd.opt.64.stripped --path ~/Documents/Godot/test_sse4.2 --quit
  Time (mean ± σ):      2.394 s ±  0.282 s    [User: 1.508 s, System: 0.165 s]
  Range (min … max):    1.605 s …  2.546 s    10 runs

Benchmark #2: bin/godot.linuxbsd.opt.64.sse4.2.stripped --path ~/Documents/Godot/test_sse4.2 --quit
  Time (mean ± σ):      2.199 s ±  0.429 s    [User: 1.499 s, System: 0.169 s]
  Range (min … max):    1.578 s …  2.544 s    10 runs

Summary
  'bin/godot.linuxbsd.opt.64.sse4.2.stripped --path ~/Documents/Godot/test_sse4.2 --quit' ran
    1.09 ± 0.25 times faster than 'bin/godot.linuxbsd.opt.64.stripped --path ~/Documents/Godot/test_sse4.2 --quit'