Implement a multiple runtime approach in the engine so we can enable all CPU features with auto vectorization

RevoluPowered commented 1 year ago

Describe the project you are working on

The Mirror Game Development Platform

Describe the problem or limitation you are having in your project

We would like to use AVX/AVX512/AVX2/AVX256 in a compatible way as it gives us about 25% increased performance/FPS under load.

Why not use `attribute` or define runtime AVX types?

It misses out on automated vectorization which can be done by the compiler, it makes literally everything manual vs. most things being automated.
It's also not compatible everywhere.
Underlying structures have to be changed to use _m128d etc.

Describe the feature/enhancement and how it helps to overcome the problem or limitation

We implement a scons command use_runtimes=yes and a python file that stores the enabled runtimes. (AVX, NO AVX, MMX, ETC)

We create a small (100kb executable) that checks the CPU functionality of the machine and picks the correct "template" in the runtime folder. This would check what CPU extensions are supported and pick the template present with the highest feature set that is compatible with the hardware.

The user compiles the engine with the standard options: they do not get the runtimes folder. The user compiles with the argument use_runtimes=yes to enable the runtime folder mode NOTE: everything stays the same unless the user enables this feature.

Scons then checks a file runtimes.py for the AVX/SIMD options to build for. (I assume some projects might want to be picky about their features to support)

Example a binary is generated with AVX512:

On Windows the compiler option is: /arch:AVX512

On Clang and GCC (the best option to use is -march targeting a specific CPU, on apple this is easy as there is basically only two (intel with AVX512 and M1 with NEON)) -march=xeon_phi

So at the end you will have the following:

- MyAwesomeGodotApp.binary
runtimes/
- legacy.bin (no AVX, default Godot Template)
- intel-pentium.bin # our existing template build with march=pentium (MMX)
- intel-high_end.bin # template with AVX256, AVX2, AVX  enabled (even some more specific optional CPU instructions)
- amd-high-end.bin # same as intel (AVX512 + all of above)

This is obviously a scenario where it would use much more space, but I'll put this in perspective, a larger game doesn't mind using an extra 1-2 GB, so it can use the client's CPU properly.

Why use -march? Simply: NEON support comes with it too.

Benchmark tests to support this proposal:

https://github.com/godotengine/godot-proposals/issues/4563#issuecomment-1720452253

Describe how your proposal will work, with code, pseudo-code, mock-ups, and/or diagrams

as above

If this enhancement will not be used often, can it be worked around with a few lines of script?

no not really, it's annoying to write without doing it upstream

Is there a reason why this should be core and not an add-on in the asset library?

it's impossible to be an addon it requires godot is directly compiled with it

Why I think this is a good approach

Older CPU's get some care taken to them MMX etc all works.
Newer CPUs can take advantage of newer instruction sets.
Games can add or remove support with their own control over which "features" you want to pre-compile for the users.
Performance is theoretically higher than using avx options
thirdparty libraries can just work with AVX correctly. (tested and working)
any libraries that are missing AVX kernels like embree can just add it for increased performance (if embree is being kept)
you can use any custom cpu instructions this way, not just AVX.

Why I dislike this approach

More builds (but this can be done when you're shipping rather than during development)
Bigger "app folder/binary" NOTE: We would not need to run these builds on the godotengine/godot repository.

Estimated time to do this:

(2-3 days) to write a small target for checking host cpu capabilities
(2-3 days) to write a target list for scons so it can create a list of the platforms in advance and compile them all.
(2-3 days) a new bug will appear likely mac will need to put the binaries all under MyApp.app/Contents/<> so that PCK loading works.

It could probably be crunched out in a weekend if done correctly.

yosoyfreeman commented 1 year ago

It could probably be crunched out in a weekend if done correctly.

Things can be done correctly or crunched, but not both.

Calinou commented 1 year ago

- legacy.bin
- intel-pentium.bin
- intel-high_end.bin
- amd-high-end.bin

I wouldn't name these according to CPU brands or models (also because 11th-gen Intel supports AVX512). Just name them according to the highest instruction set they use (sse2, sse4.2, avx2_fma3 and avx512 respectively).

Also, as I mentioned in https://github.com/godotengine/godot-proposals/issues/4563#issuecomment-1720452253, I think providing sse2, avx2_fma3 and avx512 would cover the existing spectrum of CPUs very well. SSE2 covers really old CPUs, AVX2 + FMA3 covers the majority of CPUs out there and AVX512 provides additional benefits to some modern CPUs. That said, I would do benchmarks first to see if compiling with AVX512 on a supported CPU is really worth the effort.

RevoluPowered commented 1 year ago

From our discussion in Godot Engine chat, we discussed:

using dynamic dispatch

I think we could possibly try to get the best of both worlds.

Also I will provide a log here with the parts of the godot code base which have benefits from auto-vectorization perhaps they could be targeted by dynamic dispatch.

There is a clang flag to dump all successful SIMD autovectorizations: clang -Rpass=loop-vectorize

And failed vectorizations: clang -Rpass-analysis=loop-vectorize

lawnjelly commented 1 year ago

You can e.g. make a build with SSE2 as it is mandated on 64 bit x86, and one with e.g. AVX and earlier (but beware of missing in between features!). Just to mention the alternative (which most use) is to detect CPU at startup and use dynamic dispatch for hot paths.

See info in https://github.com/godotengine/godot-proposals/issues/290 .

You'll get some "free" gain for just compiling with e.g. AVX and letting the compiler do its best (although I wonder if SSE2 gets a lot of low hanging fruit already), but you'll likely get a lot more gains by profiling for hotspots and writing intrinsics for these (or even using auto-vectorization with dynamic dispatch if that works). And don't forget restructuring the data to be cache / SIMD friendly as this often is the limiting factor.

As ever, profiling is key because in a lot of cases the CPU is waiting on loading stuff into cache rather than being overworked.

peastman commented 1 year ago

Just allowing the compiler to use AVX intructions should be beneficial. For example, it introduced a new encoding scheme that allows for three argument operations like a=b+c. Prior to AVX, it only had two argument math operations of the form a+=b. That eliminates a lot of extra instructions for shuffling data between registers.

To get the full benefit, though, you need to take advantage of vectorization. Auto-vectorization may do a little, but @lawnjelly is right that there's no substitute for restructuring data and writing intrinsics. But that's a big project, and a simple change to the compiler flags can still help a lot.

Jm15itch commented 1 year ago

If we were to go with this, I believe it would be best to name these as x86-64-v1.bin, x86-64-v2.bin, etc to be consistent with the x86-64 abi. Info on that can be found here: https://gitlab.com/x86-psABIs/x86-64-ABI

Facundo15 commented 1 year ago

I am following this proposal quite closely due to some interest in intrinsics, but I have a few comments to add with the use of AVX intrinsics, specifically those of 256 and 512.

SSE would already meet the vast majority of use cases for the engine, those are the addition, subtraction and multiplication of vectors to take advantage of them in physics, or anywhere else. There is no need to use larger instructions for other cases unless you can create a function that adds several vectors at the same time.

This is a code snippet I'm working on to write a bullet hell module and use the different types of intrinsics for each case.

__m128 xmm_position = _mm_set_ps(
p1->position.y, p1->position.x,
p2->position.y, p2->position.x);

__m128 xmm_direction = _mm_set_ps(
p1->direction.y, p1->direction.x,
p2->direction.y, p2->direction.x);

__m128 xmm_speed = _mm_set_ps(p1->speed, p1->speed, p2->speed, p2->speed);
__m128 xmm_delta = _mm_set_ps1(delta);
__m128 xmm_sp_d_result = _mm_mul_ps(xmm_speed, xmm_delta);

__m128 result = _mm_add_ps(xmm_position, _mm_mul_ps(xmm_direction, xmm_sp_d_result));
const float *f_result = reinterpret_cast<float *>(&result);
p1->position.x = f_result[3];
p1->position.y = f_result[2];
p2->position.x = f_result[1];
p2->position.y = f_result[0];

Where AVX256 and AVX512 can be greatly used is in the PackedVector3Array where it is more likely that you will want to apply a basic addition, operation and subtraction operation to all the vectors and you can have much fewer iterations and instructions to operate on multiple data. And with large sets of elements, there is a clear improvement.

Where can you take advantage of this SIMD improvement with PackedVector3Array?

Right now I can say that the best beneficiaries would be CPU Particles and animations that require moving multiple vertices of a model to the same direction

peastman commented 1 year ago

SSE would already meet the vast majority of use cases for the engine, those are the addition, subtraction and multiplication of vectors

That's true in single precision. If you want to compile in double precision, SSE can only process two elements at a time but AVX can process the whole vector.

godotengine / godot-proposals