Use SIMD vectorization intrinsics in the core

peastman commented 2 years ago

Describe the project you are working on

Make the engine faster.

Describe the problem or limitation you are having in your project

In modern CPUs, most of the compute resources are found in the vector units. In fact, all floating point math is done with vector units. If you add two floating point numbers together, you are actually adding two vectors together and then discarding all but one element of the result. If you don't take advantage of vectorization, you are missing out on most of the potential performance.

Compilers try to automatically vectorize code, but they aren't very good at it. For all but the simplest cases, it is essential to write explicitly vectorized code if you want to get good performance.

Describe the feature / enhancement and how it helps to overcome the problem or limitation

This proposal has two main parts.

Introduce an API for writing vectorized code.
Make use of it to optimize important operations throughout the engine.

The first one is a small, self contained project. The second one is open ended and will happen gradually as we introduce vectorization to important calculations. It also requires a third part as a side project:

Add more benchmarks to https://github.com/godotengine/godot-benchmarks.

That will serve three important functions. 1) Help us to identify bottleneck code that will benefit most from vectorization. 2) Let us verify that the changes really make it faster. 3) Make sure we don't inadvertently make something else slower in the process.

Describe how your proposal will work, with code, pseudo-code, mock-ups, and/or diagrams

This proposal is based on the discussion at #4544. We identified the following requirements:

It must be written in a way that works on x86 (including old CPUs that don't support AVX), ARM, and WebAssembly. We also need to be sure we can support RISC-V in the future.
It should be written in a way that keeps the code clean, readable, and maintainable. Raw intrinsics produce ugly code. Multiple sets of intrinsics for multiple architectures grouped with #ifdefs are even worse. We don't want that scattered through the codebase.
It needs to work when compiling in either single or double precision mode.

One possible approach would be to vectorize very low level functions like Vector3.operator+(). This is unlikely to have much benefit, however. Those functions expect their inputs and outputs to be stored in memory, not SIMD registers. The memory access will usually take much longer than the arithmetic. To get good performance, you need to put data into SIMD registers and then do as much computation as possible before writing it back to memory. Copying between memory and registers for every operation will not give good performance.

Potentially this could be addressed by reimplementing Vector3 to internally store its data in a __m128 (SSE) or float32x4_t (NEON) instead of a float[3]. That would allow much better performance, but also would break compatibility. Vector3 is a struct that directly exposes its internal representation. We can't change the representation without changing the API.

So I don't think we should do that at this time. We should consider a smaller change, though: adding a fourth element as padding to the internal storage, so sizeof(Vector3) would be 16 instead of 12. This would make vectorization much more efficient. Loading or storing a SIMD register from or to a span of memory of the same size takes only a single machine instruction. Loading or storing 3 of the 4 elements to a 12 byte memory field requires a more complex, slower sequence of operations.

This does have a risk of making some code slightly slower, due to increased memory traffic and decreased cache efficiency. I expect that to be minor and easily outweighed by the benefits to vectorization efficiency, but we need to test it to be sure.

In any case, vectorization will mostly involve adapting larger pieces of code where we can load inputs into registers and do lots of calculation before writing the results back to memory. A good collection of benchmarks will be essential for identifying those pieces of code.

There are a number of possible vectorization APIs we could choose. See the discussion in #4544 for links to some of them. I am proposing to adapt one that I created for another project I work on, OpenMM. I believe it is the best choice for our needs (and not just because I wrote it!). It involves a small amount of very simple, self contained code. It provides a very clean, easy to use interface that makes vectorized code clean and readable. It has been in continuous use since 2014 and has served our needs very well.

The API involves classes to represent SIMD vectors. For example fvec4 is a four component vector of floats. It gets compiled down to a raw vector (e.g. a __m128 for SSE) while implementing all standard math operators and functions. This lets you write simple code like

fvec4 x(1, 2, 3, 4);
fvec4 y = sqrt(x/2+1);
float z = y[3];

To use it, you include a single header file. It uses #ifdefs to include the appropriate implementation for the current architecture. For example, here is the implementation for SSE.

I propose to include the following classes: fvec4 (four float components), dvec4 (four double components), and ivec4 (four int components). (Other suggestions for names are welcome.) Because we don't want to use AVX, dvec4 will have to be implemented with two 128 bit registers, so it will be slower than the others. In addition, it will typedef rvec4 (four real components) to either fvec4 or dvec4, depending on whether you are compiling in single or double precision mode.

I will include three implementations of these classes: SSE (for x86), NEON (for ARM), and clang/gcc portable vectors. The portable version will work on WebAssembly and future architectures like RISC-V, although the performance may not be as good as using native intrinsics. Adding native implementations for those architectures in the future would not be difficult, since it just means creating a single file.

If this enhancement will not be used often, can it be worked around with a few lines of script?

No.

Is there a reason why this should be core and not an add-on in the asset library?

The goal is to make the core faster.

logzero commented 2 years ago

adding a fourth element as padding to the internal storage

That is the typical naive approach to SIMD. If you are looking for max performance you'd want to go with SIMD sized tuples of vectors fvec3simd { float x[SIMD_WIDTH]; float y[SIMD_WIDTH]; float z[SIMD_WIDTH]; } and adjust code to work on tuples.

Generally I'd also suggest to identify compute intensive code parts first and then convert them to SIMD. Naively converting all vector calculations to SIMD might have little to no benefit.

lawnjelly commented 2 years ago

Generally I'd also suggest to identify compute intensive code parts first and then convert them to SIMD. Naively converting all vector calculations to SIMD might have little to no benefit.

I would tend to agree with this. Coming up with a sensible strategy should probably start with profiling and identifying areas which might benefit. There are two factors here: 1) Optimizing bottlenecks 2) Death by a thousand cuts .. lots of small operations that don't appear on the profile might be adding up

(2) might be used as an argument for the naive changing Vector3 to have a dummy fourth component (personally I'm not sure this is worth doing). I'm not sure that (2) is a major factor currently, versus the cost of just reading the data from memory.

Usually when profiling, the major performance issues I see are mostly: 1) Higher level methods which could be improved 2) Doing things in a flat out inefficient way that could be done in more efficient way (e.g. using wrong data structure) 3) Data all over the place .. cache misses

Things that would benefit from SIMD (and make a big difference) may currently be rather limited, as the idea is to push expensive stuff to the GPU. Things like audio, software skinning (in 3.x), image processing (although this is more likely to be limited to startup). The occlusion culling already uses Embree for the expensive stuff which presumably uses SIMD (and likewise for the software lightmap generation).

For instance, sure the scene tree has to update transforms, but the nodes are all over the place in memory (versus say DoD with ECS), so while it is a good idea that e.g. concatenating xforms is using SIMD (which it may even be already), the actual change in performance of doing this may be less than stellar.

That's not to say it isn't useful - that's why I opened #290 , as I suspect exposing this to users is at least as important (if not more) than using it in core.

Anyway yes some profiles of running games would be useful here with some bottleneck points identified. :slightly_smiling_face:

peastman commented 2 years ago

Generally I'd also suggest to identify compute intensive code parts first and then convert them to SIMD. Naively converting all vector calculations to SIMD might have little to no benefit.

I 100% agree with this. Hence the proposal to expand the set benchmarks. I wouldn't attempt to vectorize any piece of code until we have a benchmark showing it's a significant piece of some important, real world operation.

Whatever code you end up vectorizing though, you ultimately need to start by loading your input data into registers, and you need to end by writing results back to memory. That's the reason I suggested the padding element. It will make those reads and writes more efficient. This will reduce the overhead on every vectorized routine. It will make them faster, and also increase the number of routines that are worth vectorizing.

peastman commented 2 years ago

I wanted to put some real numbers on this, and also to see what's the smallest function that's worth vectorizing. Could a tiny method like Vector3::operator+() benefit? So I wrote the following program to test it. Please forgive the microbenchmark! All such things should be taken with a grain of salt. This is just meant as a proof of concept.

I wrote three versions of a function to add a pair of three component vectors. add1() is written with ordinary scalar code. add2() is vectorized with SSE. It assumes the inputs and outputs only contain three elements, so it has to do the loads and stores in a less efficient way. add3() uses the more efficient intrinsics for loading and storing all four elements. I tried to prevent the compiler from doing unrealistic optimizations. It builds a large table of input vectors in memory (so it can't optimize away the reads and writes), and it prints out the sum of all outputs at the end (so there's no dead code).

Here is the full program.

#include <smmintrin.h>
#include <cmath>
#include <cstdio>
#include <sys/time.h> 

void add1(float* a, float* b, float* out) {
    out[0] = a[0]+b[0];
    out[1] = a[1]+b[1];
    out[2] = a[2]+b[2];
}

void add2(float* a, float* b, float* out) {
    __m128 avec = _mm_set_ps(0, a[2], a[1], a[0]);
    __m128 bvec = _mm_set_ps(0, b[2], b[1], b[0]);
    __m128 result = _mm_add_ps(avec, bvec);
    float temp[4];
    _mm_storeu_ps(temp, result);
    out[0] = temp[0];
    out[1] = temp[1];
    out[2] = temp[2];
}

void add3(float* a, float* b, float* out) {
    __m128 avec = _mm_loadu_ps(a);
    __m128 bvec = _mm_loadu_ps(b);
    __m128 result = _mm_add_ps(avec, bvec);
    _mm_storeu_ps(out, result);
}

double getCurrentTime() {
    struct timeval tod;
    gettimeofday(&tod, 0);
    return tod.tv_sec+1e-6*tod.tv_usec;
}

int main() {
    float a[1000][4], b[1000][4];
    for (int i = 0; i < 1000; i++)
        for (int j = 0; j < 4; j++) {
            a[i][j] = sin(i+0.1*j);
            b[i][j] = cos(i+0.1*j);
        }
    float sum = 0.0;
    float out[4];
    double t1 = getCurrentTime();
    for (int i = 0; i < 100000000; i++) {
        add1(a[i%1000], b[i%1000], out);
        sum += out[0]+out[1]+out[2];
    }
    double t2 = getCurrentTime();
    for (int i = 0; i < 100000000; i++) {
        add2(a[i%1000], b[i%1000], out);
        sum += out[0]+out[1]+out[2];
    }
    double t3 = getCurrentTime();
    for (int i = 0; i < 100000000; i++) {
        add3(a[i%1000], b[i%1000], out);
        sum += out[0]+out[1]+out[2];
    }
    double t4 = getCurrentTime();
    printf("add1: %g\n", t2-t1);
    printf("add2: %g\n", t3-t2);
    printf("add3: %g\n", t4-t3);
    printf("ignore this: %g\n", sum);
    return 0;
}

Here is the result when compiled with -msse2 -O3.

add1: 0.232947
add2: 0.254139
add3: 0.17004

If we're able to use the efficient method for reads and writes, then even a tiny function like this gets a significant speedup. If we have to do reads and writes in the inefficient way, that negates the benefit. Larger functions should have much larger speedups in both cases.

peastman commented 2 years ago

Would it help to consider the pieces of this individually? I assume everyone agrees with the goal of making the core faster. 😀 Here are the specific things I'm proposing.

Add more benchmarks. I assume this also is not controversial!
Adopt an interface that lets us write SIMD code in a clean, portable way. I suggested a particular choice. There are other options we could also consider.
Vectorize pieces of code when and only when the benchmarks show that it helps.
Add a fourth padding element to Vector3. This might be the most controversial piece. It would be done if and only if the benchmarks show it helps.

Any comments/agreement/disagreement about each one of these?

RevoluPowered commented 1 year ago

I think this is a great idea for core, here is my 2 cents though:

We need to detect SSE/AVX/AVX512/AVX2 at runtime in the code to ensure compatibility
We should probably also still disable it for thirdparty as a lot of tools expect compile time options for these flags.

Something like this might be good for core/ https://gist.github.com/hi2p-perim/7855506

I might try implementing this for Vector3 as part of this proposal soon, I want this for something related to physics, something to infer at runtime the processor support would be great.

There are also some useful instructions we could use too as most of them have fallbacks on older CPUs : FMADD LZCNT BZCNT

I think for Vector3 we should try a very bare bones implementation and swap out the implementation at runtime

class Vector3 {
  Vector3 add( Vector3 a, Vector3 b ) { return implementation->add(a,b); }
  Vector3 add_many( Vector3[] vecs ) { return implementation->add_many(vecs); }
  static * implementation = detect_platform<Vector3>() # returns the implementation to use
}

class AVXVector3 : Implementation<Vector3> {
#ifdef USE_DOUBLE
__m256 vector_data;
void add( Vector3 a, Vector3 b) {
    // TODO: example
}
#else 
_m256d vector_data;
#endif
}

class NoExtensionVector3: Implementation<Vector3> {
float x,y,z; 

# all existing operations from godot like dot etc go here
}

Once we have core done we could have a single header library built for abstracting this option at runtime, then we can ask thirdparty/ libs to try to use it perhaps.

RevoluPowered commented 1 year ago

I've done some benchmarks using a patch I made:

first the patch to enable auto-vector generation: apply this to modules/raycast/SCSub to make embree not look for AVX options

if '-march=native' in env_raycast['CCFLAGS']:
    env_raycast['CCFLAGS'].remove('-march=native')

if '-march=native' in env_raycast['CXXFLAGS']:
    env_raycast['CXXFLAGS'].remove('-march=native')

compile the engine with CCFLAGS=-march=native CXXFLAGS=-march=native on the end.

The results are promising:

import times (under load higher performance)

with the fast binary (-march=native): 94.09s user 47.32s system 282% cpu 50.054 total with the slow binary: (tested twice variance was 10 FPS~ 125s user time ) 133.03s user 40.84s system 345% cpu 50.382 total

36 second improvement on 256 cores.

Game performance (under load higher performance)

115 FPS - march=native binary (although both had variances as we are mostly GPU bound) 85 FPS - slow binary

Menu performance: (degraded performance)

825 FPS - slow binary 802 FPS - march=native binary

Calinou commented 1 year ago

Menu performance: (degraded performance)

At such high framerates, frametime differences are minimal and may be due to external factors (such as clocks reducing due to the power budget being exceeded). AVX512 is notorious for forcing downclocking of CPUs in some situations.

RevoluPowered commented 1 year ago

Menu performance: (degraded performance)

At such high framerates, frametime differences are minimal and may be due to external factors (such as clocks reducing due to the power budget being exceeded). AVX512 is notorious for forcing downclocking of CPUs in some situations.

This is great to know, I think we should possibly upstream an option in scons for this so that people can test the waters with AVX but not commit to using it (yet)

I could open a PR and add a flag like native_arch_experimental=yes/no/

It could be a good option for people that use the dev= flag too.

RevoluPowered commented 1 year ago

I have another avenue to try with -ftree-vectorize I will also try this out.

EDIT:

app perf: 97 FPS

Calinou commented 1 year ago

Remember that -march=native scopes the compiler optimizations to your current CPU model, and therefore makes the resulting binary pretty useless to distribute to others (it usually won't run). It's meant to be used in HPC scenarios when every cycle matters and you compile your own binaries.

RevoluPowered commented 1 year ago

Remember that -march=native scopes the compiler optimizations to your current CPU model, and therefore makes the resulting binary pretty useless to distribute to others (it usually won't run). It's meant to be used in HPC scenarios when every cycle matters and you compile your own binaries.

Yeah I enable this purely to test the auto-vectorization to see what the differences are before sinking time into writing code for manual SIMD. I think this option might be useful for people using source builds locally on their machine for import times etc.

I think we still need to establish how to even give a user a binary with AVX if we did enable it, especially since compat is important.

Having a portable runtime might be an idea with the different optimisations but requires thought/approval too.

Something like this:

- ./GodotBinary.exe 
- runtimes/
- editor-runtime-compat.bin (no avx, nothing just -O3/-O2)
- editor-runtime-amd64-SIMD.bin (specify denominator for the medium end intel cpu support with march)
- editor-runtime-intel-SIMD.bin (specify denominator for the medium end amd cpu)

Calinou commented 1 year ago

If we go for the route of distributing several binaries, I would just distribute SSE2 (or SSE4.2) and AVX2 binaries (perhaps AVX2 + FMA3 as mentioned below). The number of CPUs that support AVX but not AVX2 is quite low[^1], and you won't be running very demanding games/applications on those given their limited processing power.

[^1]: This is essentially Intel Sandy Bridge, Ivy Bridge and AMD Bulldozer/Piledriver. These CPUs were all released over 10 years ago.

peastman commented 1 year ago

Rather than specifying native it might be better to specify a specific architecture. See https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html. You can give a few options, maybe one very old architecture for compatibility, a slightly more recent one for games targeting higher performance, and a very recent one for high end ones targeting the very latest hardware.

It's more than just AVX. Newer architectures have lots of newer instructions. For example, FMA is a really important one. It can make a big difference in speed. It was added in Haswell, introduced in 2013.

RevoluPowered commented 1 year ago

Rather than specifying native it might be better to specify a specific architecture. See https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html. You can give a few options, maybe one very old architecture for compatibility, a slightly more recent one for games targeting higher performance, and a very recent one for high end ones targeting the very latest hardware.

It's more than just AVX. Newer architectures have lots of newer instructions. For example, FMA is a really important one. It can make a big difference in speed. It was added in Haswell, introduced in 2013.

Yep, definitely this is what I would do for the "runtime" approach too, as we have to keep compatibility with CPU's that support almost none of these options :)

peastman commented 1 year ago

I think it's the other way around: Zen only supports FMA3, same as Intel. According to that article, FMA3 is in all AMD processors since Piledriver (2012) and all Intel processors since Haswell (2013).

RevoluPowered commented 1 year ago

I think it's the other way around: Zen only supports FMA3, same as Intel. According to that article, FMA3 is in all AMD processors since Piledriver (2012) and all Intel processors since Haswell (2013).

TODO: add compatibility matrix from my own setups!

Have we got a OS.get_system_cpu_extension list somewhere? I might make some kind of print so we can collect some data for those interested in SIMD. Just like SSE/AVX/ what versions, is X instruction supported. you get me.

I can check: intel mac epyc (new and old gen) ryzen 4100 ryzen 5600/X

possibly need an i3 and low end device to check laptop wise too. Maybe I'll pick up an old thinkpad for ultracheap.

I'm guessing it's a no so I will make a PR!

Calinou commented 1 year ago

Have we got a OS.get_system_cpu_extension list somewhere? I might make some kind of print so we can collect some data for those interested in SIMD. Just like SSE/AVX/ what versions, is X instruction supported. you get me.

Not that I know of. If you need testing, I have access to a PC running one of those Pentiums that doesn't support AVX/AVX2.

RevoluPowered commented 12 months ago

We should also look at MMX as this is the first type of extension which we will want for intel machines around 2017 release date. (source @Calinou PC)

peastman commented 12 months ago

MMX was the predecessor to SSE. It's long since obsolete.

What do you think about repeating your benchmarks with a few different values for -march? I'd expect a significant speedup from sandybridge (introduced AVX), and a further speedup from haswell (introduced AVX2 and FMA). I don't know if anything beyond that will make much difference. Newer architectures added AVX512, but that likely won't help much for code that isn't explicitly vectorized.

peastman commented 12 months ago

Newer architectures added AVX512, but that likely won't help much for code that isn't explicitly vectorized.

I take it back. After doing a bit of research, I found that AVX512 doubled the number of registers to 32. That could have a big benefit, even if your code isn't vectorized.

Unfortunately, support for it is really inconsistent. AVX512 was first introduced in server processors, then added to consumer processors, and then removed from the consumer processors again! And now Intel is working on something called AVX10 which is basically all the features of AVX512, but with only 256 bit registers.

Intel's history of vector extensions is such a mess...

Calinou commented 12 months ago

And now Intel is working on something called AVX10 which is basically all the features of AVX512, but with only 256 bit registers.

I believe Zen 4's AVX512 support is similar to that too.

rossbridger commented 2 months ago

Is there an update for this proposal? I do think we can assume at least SSE2 or neon will always be available.

Calinou commented 2 months ago

Is there an update for this proposal? I do think we can assume at least SSE2 or neon will always be available.

There was an attempt in godotengine/godot#86340, I suggest reading this comment: https://github.com/godotengine/godot/pull/86340#issuecomment-1866203477

rossbridger commented 2 months ago

I just read the comments, so basically we need a C++ wrapper for simd intrinsics like xsimd

godotengine / godot-proposals