Add single instruction multiple data functionality for GDScript / C# / C++

lawnjelly commented 4 years ago

Describe the project you are working on: High computation in gdscript.

Describe the problem or limitation you are having in your project: While gdscript is great for many purposes, it sacrifices performance for the portability and ease of use. I have come across this as a limitation in computationally intensive tasks before.

I came upon this thread, where people discuss the issue: https://www.reddit.com/r/godot/comments/e71g7x/high_performance_code_in_gdscript/

Describe how this feature / enhancement will help you overcome this problem or limitation: One of the fundamental changes that can increase the speed of code (in any language) is to move from a paradigm of functions that operate on a single item of data, to those that operate on multiple data. As well as encouraging a more optimal layout of the data, this often allows compilers to better optimize the procedure. It also offers more opportunities for autovectorization (generation of SSE and Neon SIMD code), which can offer significant speedup.

To distinguish between actual SIMD cpu instructions and the multiple data from one instruction paradigm, I will call the latter 'ranged' functions.

Show a mock up screenshots/video or a flow diagram explaining how your proposal will work: I therefore made a quick mockup module yesterday in c++, testing an equivalent function against 'normal' gdscript. The functions take the form:

function_name(arguments, range_from, range_to)

const size = 5000000

func TestSIMD():
    var fast_arr = FastArray_4f32.new()
    fast_arr.reserve(size)

    var norm_arr = []

    for i in range (size):
        fast_arr.write(i, Quat(1, 1, 1, 1))
        norm_arr.push_back(Quat(1, 1, 1, 1))

    var before = OS.get_ticks_msec()
    fast_arr.value_add(Quat(1, 1, 1, 1), 0, size)
    var after = OS.get_ticks_msec()

    var q_add = Quat(1, 1, 1, 1)
    var before2 = OS.get_ticks_msec()
    for i in range (size):
        norm_arr[i] += q_add
    var after2 = OS.get_ticks_msec()

    var q = fast_arr.read(0)
    print("result : " + str(q))

    print("timing SIMD " + str(after - before))
    print("timing array " + str(after2 - before2))

The timings on my PC give: timing ranged 7ms timing array 1953ms

This is a 279x speed increase. _(This is not exactly like for like, as the module is built with -O3 and godot is built with releasedebug, but you get the idea).

Describe implementation detail for your proposal (in code), if possible: I have made a proof of concept module: https://github.com/lawnjelly/godot-simd

Some of the most commonly used data structures in godot are the float Vector2, Vector3, and Vector4 (only available as a Quat / Plane etc currently rather than explicit vec4). These work reasonably with the 4 value 32 bit float SIMD registers available on nearly all modern computers. As a result I built the proof of concept to work with this unit, which corresponds to _m128 in SSE.

With a Vector3 this does represent a potential 'waste' of the 4th float, however this is probably outweighed by the gains in speed due to alignment. The 4th float is also useful to return a result, I have used it to store lengths, square lengths and dot products for example.

Although some of the specific functions I have added rely on this arrangement, you can also use a SoA (structure of arrays) with the functionality if desired.

Some notes

I have not yet added thread safety
I am currently using stl::vector, but this is easy to swap out
I am also intending to add variable stride versions
Any SSE / Neon code (in separate .cpps) can be compiled out with #defines, most of the reference functions are auto-vectorized by the compiler. (Just adding reference code for auto-vectorization would work great in most cases if we didn't want to add any intrinsics, but it is worth noting that some instructions such as inverse sqrt cannot be auto-vectorized and may need the intrinsic version to access the SIMD instruction.)

I am also not proposing that this should be a finished product, I am open to ideas on how best to add this type of functionality, down to naming conventions etc. willnationsdev pointed out the difficulty of adding new types to Variant, so I've added a new object type derived from Reference, this seems to work. There may be better ways of passing some of the arguments.

One important consideration may be how to best get data in / out of the system to actually do something with the result. This is a common theme with SIMD. As such I have added proof of concept functions to fill the fast array from PoolVector3 and return the result to a PoolVector3.

If this enhancement will not be used often, can it be worked around with a few lines of script?: Cannot be worked around with script.

Is there a reason why this should be core and not an add-on in the asset library?: I'll probably round this out as a module even if it is decided not to put something like this in core. However, for something so simple, I think it offers a lot of potential, both to people making games, tools, and also potentially for use from other parts of the core.

Functions already included

read / writing elements and reserving array
read / writing as pool arrays
add, subtract, multiply, divide
dot, cross, unit cross, normalize
length, length squared
sqrt, inverse sqrt, reciprocal
xform, inverse xform

More timings (for those interested to compare intrinsics)

This time with a PoolVector3Array for gdscript, to make things fairer, and comparing sqrt functions for 20,000,000 sqrts, using either normal gdscript, ranged non-SSE code, and ranged SSE1 code:

timing gdscript 8228ms timing ranged 86ms timing SSE 8ms

This is over 1000x speed increase with SSE. I have looked at the potential for SSE at the same time while doing the ranged functions. Up to SSE2 can be used with no need for CPU detection on 64 bit x86, as it is mandated. With CPU detection we could use AVX512, which would theoretically be up to 4000x faster (although memory speed might become more a bottleneck).

Things like sqrt and reciprocal sqrt, length calculations and normalization really get a lot faster because SIMD uses a slightly less accurate but much faster version than standard sqrt. Afaik these can only be accessed by intrinsics, I don't think you can get the autovectorizer to make them for you, as there is a small loss of accuracy.

aaronfranke commented 4 years ago

which corresponds to _m128 in SSE.

It's important to also make this work with __m256d for the purposes of https://github.com/godotengine/godot/issues/288 (Vector2 can use __m128d instead, which will work for CPUs without __m256d / AVX2)

It's also worth looking into making this work with Bullet, currently there are some TODOs about SIMD being disabled because it's actually slower for some reason.

lawnjelly commented 4 years ago

which corresponds to _m128 in SSE.

It's important to also make this work with __m256d for the purposes of godotengine/godot#288 (Vector2 can use __m128d instead, which will work for CPUs without __m256d / AVX2)

It's also worth looking into making this work with Bullet, currently there are some TODOs about SIMD being disabled because it's actually slower for some reason.

Sure this is easy :smile: , I just started with vec4f_32 as it is the most common so far in Godot. Double is no problem as are the integers. Integers are even more fun, you can really start giving awesome speedups with vectorization .. for instance with AVX512 (if it has the byte version, not sure offhand) you can do 64 calculations in a single instruction.

I was asking Calinou about the minimum SSE / Neon we support. It seems for x86_64 SSE2 is mandated, so the compilers will autovectorize providing you tell them to (e.g. -O3). For x86_32 it may be worth specifying SSE2 flag (I think this may be done in the odd place already).

For any more than this though, we'd need to detect support at runtime. This is fairly easy to put in, and choose the codepath according to the CPU support, rather than rely on autovectorization. However that's getting ahead of ourselves. Last time we mentioned it reduz might need convincing on the use of intrinsics. Having said that, if the basic support for ranged functions is put in, it is fairly trivial to add intrinsics.

PetePete1984 commented 4 years ago

You reserved space in the fast array but left the normal array empty, that immediately skews the result. I already experienced significant speedups with a .resize() for any moderate array initialization. push_back would just resize by 1 every time, basically reallocating the entire thing. Not saying it'll be as fast as native code, but faster than this example - most likely.

lawnjelly commented 4 years ago

You reserved space in the fast array but left the normal array empty, that immediately skews the result. I already experienced significant speedups with a .resize() for any moderate array initialization. push_back would just resize by 1 every time, basically reallocating the entire thing. Not saying it'll be as fast as native code, but faster than this example - most likely.

Sorry I could have made this more clear. The reservation method isn't under test - it isn't timed (or indeed a factor in this proposal). I could have equally used a resize and set than push_back, as you say.

The performance timing is wrapped around the mathematical functions under test, in this case addition, with the:

var before = OS.get_ticks_msec()

and

var after = OS.get_ticks_msec()

calls.

In fairness, the basic gdscript array in the example I gave is not contiguous in memory, so I made the timings in the 'more timings' section against a PoolVector, which is contiguous, giving a more level playing field.

I would welcome others trying some timings themselves, you just need to download the module and compile the engine. That reminds me I may need to tweak the compiler switches in the module SCsub to work under visual studio (I only have linux here).

vnen commented 4 years ago

High computation in gdscript.

This sounds like an oxymoron to me. If you want to have high-performance code you should not be using GDScript in the first place. That's what GDNative/NativeScript is for (using fast C code as a script).

If the engine internal API can be optimized using vectorization and similar tricks, which in turn could be used by GDScript, then it should be fine. But having high-performance constructs for GDScript itself is not worth it IMO.

lawnjelly commented 4 years ago

High computation in gdscript.

This sounds like an oxymoron to me. If you want to have high-performance code you should not be using GDScript in the first place. That's what GDNative/NativeScript is for (using fast C code as a script).

If the engine internal API can be optimized using vectorization and similar tricks, which in turn could be used by GDScript, then it should be fine.

Yes sorry for not being more clear, these type of instructions could be available throughout for use by c++ modules, gdnative, the main engine, c#, gdscript and any other language bindings via the API. I believe they already are in my demo module, afaik you don't have to do anything special apart from derive the classes from something like Reference / Node etc, do the proper binding etc. I'm not advocating adding anything specific in gdscript, no changes needed.

I guess my emphasis on gdscript is that although in scripting languages performance isn't the primary consideration, the use of functions that act on multiple sets of data at once allows them to be as performant (or in fact more so) than non-SIMD regular c++ code. Essentially you are moving the bottlenecks into the ranged c++ functions.

Mike Acton has famously done a lot on this subject: https://www.youtube.com/watch?v=rX0ItVEVjHc https://macton.smugmug.com/Other/2008-07-15-by-Eye-Fi/n-xmKDH/i-BrHWXdJ

Of course the advantage of writing high performance stuff in gdscript / c#, is that it is cross platform and doesn't require compiling the engine / gdnative, and thus easier to deploy and use in games / apps.

realvictorprm commented 3 years ago

If C# is included this probably won't be limited to C# and expand to all .NET languages. In this case I would prefer if one speaks for such features rather of .NET support instead of C# support.

nobuyukinyuu commented 2 years ago

It would be nice to have access to intrinsics for my audio project since I'm working mainly in c# and the version supported by the mono runtime (7.2 or netstandard2.1) doesn't seem to have these. I'm surprised how long this proposal has remained open!

lawnjelly commented 2 years ago

I'm surprised how long this proposal has remained open!

It looks like there is demand for this functionality, and I believe there has been a longstanding intention to add some degree of SIMD support (other than compiler autovectorization) in Godot 4. (Often if we get the go ahead for something in Godot 4 we manage to sneak it into Godot 3.x :wink: )

lawnjelly commented 2 years ago

A few more ideas I came up with on ways to implement this and be generally available as usable from the engine core as well as bound:

User Friendliness

There may be a bit of a balance to be struck between making something accessible for beginners to SIMD, and more advanced users and engine contributors. On the whole I suspect this whole area is going to be the realm of power users.

Mixed Data

One aspect that I left for later in my original mockup was how to handle mixed data. In terms of user friendliness, having a bunch of preset fixed arrays serves as a gentle introduction to SIMD. However, this does limit the usefulness for more advanced stuff, particularly in the engine core, where we might have e.g. mixed vertex formats.

Essentially, to the machine, it doesn't care what a set of data is, the format is defined just by what you write into the data, do with it, and read out. To that end I'm wondering as a more flexible alternative, instead of having multiple fixed types, we could have a single FastData type (just using that as a placeholder name).

External Data

It has occurred to me that as well as creating data within FastData, it would be useful if this object could operate on external data as well as internal. This could be as low level as providing a pointer to the source / destination data, but could also potentially be a higher level method - externally referencing another object or resource which holds the data to be operated on.

Thus the FastData could be linked with some (preset?) resource types, such as an Image, and when performing a SIMD operation it could optionally perform housekeeping like locking the image (to make it thread safe), or bounds checking all functions before they are run, checks for alignment etc.

Or perhaps Image etc could be derived from a special type of resource that had the necessary functionality for linking, that way users could easier create their own resource types that would also work natively.

Internal Data

For internal data, instead of creating an Array of e.g. 4f32, we could have functions to create an array of a number of real_ts (which are either 32 bit or 64 bit depending on how compiled) which would allow it to work seamlessly with 32 or 64 bit real_t builds, or fixed width types like f32, i32 etc. This would make it easy for common cases, while also allowing all possible variations as long as we had e.g. a u8 type where the user could specify a number of bytes.

Operating on Data

Once the data in the FastData is created or linked, it would be totally up to the user how to interact with it, by specifying start offsets in e.g. bytes (or reals) and strides etc to allow flexible mixed formats.

Extracting Results

This would ideally be coupled with nice ways of getting the final data result out of internal FastData. For instance maybe writing a result to multimesh transforms, or vertex data for uploading, or audio or a tilemap etc.

Introducing SIMD into the engine core

A lot of core contributors have expressed an interest in having SIMD available in some form in the core engine. There already will be autovectorization in some cases, and it would also be possible to e.g. add single use instructions for e.g. Vector3 etc that directly utilize intrinsics. (There are also options like making Vector3 4 elements wide etc which are discussed elsewhere).

This can be useful, but it is unlikely to lead to as many performance benefits as directly using ranged instructions imo (for a number of reasons, but essentially outside of tight ranged loops other effects start to become more important like cache misses and housekeeping).

We could possibly end up using both approaches, but hopefully there are a number of areas where changing bottleneck areas to use such ranged functions will lead to significant benefit.

RevoluPowered commented 1 year ago

Linking this here: apologies the topic seems duplicated a bit https://github.com/godotengine/godot-proposals/issues/4563#issuecomment-1719835048

Anyways, I really like this is being thought about, I do think AVX detection should be a runtime check too, so this helps this proposal to work.

In my comment I propose decoupling the data from the underlying math type and using a runtime static variable to detect which CPU features are enabled and consuming them that way.

During construction it would create the implementation, with CPU features supported and types for those feature sets. Most libraries do compile time checks, and these actually aren't good to do for a user as is mentioned in this thread.

A template for swapping out the data of a math type based on the available instruction set definitely seems to solve the usability issue, but we'd need to know if it actually yields gains, as I am sure that some of the AVX instructions rely on passing arrays of the data instead of the high level types.

We'd need to consider an approach for this in the implementation and possibly try making a PR with some demo code for one of the underlying types just to take a small step in the direction of SIMD instead of doing everything.

We could support using SIMD from gdnative, but I think using it from gdscript could be possible too provided we can ensure the user checks what cpu features are available with a match and also do this runtime swap in their code of the feature set.

GDScript: for example the template I proposed could just be a get_supported_instructions() call to OS and then a match statement, then you require the types implemented by 4563, which ensure's that the feature like AVX is available but still allows you to write the code without the gdscript parser breaking then the user could use _mm256 vs _mm256d or float x,y,z without it caring.

if we decoupled the data from the math types in my suggestion we could even make use_double runtime capable if we made a parent type for real in gdscript. (at the expense of declaring a static pointer to the implementation and structure and declaring a "Real" type instead of raw float) if the user from gdscript uses mm256 (float) or mm256d and it is unsupported we can just report the CPU doesn't support it and assert, safely, in the error say "you didn't check get_supported_instructions()"

the hard part is static typing from gdscript would require they use a Variant on their data which potentially stores mm256 vs mm256d etc

This mostly means that #4563 and #290 could be part of the same feature set, and only require us to decouple the core types correctly, and expose the vector specific instructions all of the time to gdscript, and implement a OS.get_supported_instructions()` function as a dictionary.

We could possibly have defines in gdscript but checking a dict value is dead easy.

Example of using avx variable in gdscript:

class_name ExamplePSUDO
var data = [ 0,0,0 ]
func _ready():
    if OS.get_supported_instructions["avx"]:
        data = mm256(0,0,0,0)

func do_some_operation(input, input1):
   match OS.get_current_instruction_enum():
       AVX:
           inputs = concat(input1, input2)
           return _mm256_add_ps(inputs) 
       NO_SUPPORT:
           return input + input1

Of course, my template is not 100% accurate for the instructions being used, so I will look at this and make it more accurate

RevoluPowered commented 1 year ago

@fire suggested looking at portable-simd from WASM as this might be a good approach for handling it too. https://github.com/stoklund/portable-simd/blob/master/portable-simd.md

lawnjelly commented 1 year ago

Note: changing the title back to the original. There is a distinction between single instruction multiple data and the acronym "SIMD" which is used to refer to CPU features. Single instruction multiple data is not dependent on CPU features, and still gets a lot of gains over single instruction single data (classic) function calls.

A timely example - the current discussion of ray casting API casting a single ray. More efficient is to write API that can cast multiple rays in the same call, and return multiple results. This amortizes the cost of the function call etc.

lawnjelly commented 1 year ago

Anyways, I really like this is being thought about, I do think AVX detection should be a runtime check too, so this helps this proposal to work. In my comment I propose decoupling the data from the underlying math type and using a runtime static variable to detect which CPU features are enabled and consuming them that way. During construction it would create the implementation, with CPU features supported and types for those feature sets. Most libraries do compile time checks, and these actually aren't good to do for a user as is mentioned in this thread.

The example library does this, see: https://github.com/lawnjelly/godot-lsimd/blob/master/lsimd/simd_cpu.cpp

A template for swapping out the data of a math type based on the available instruction set definitely seems to solve the usability issue, but we'd need to know if it actually yields gains, as I am sure that some of the AVX instructions rely on passing arrays of the data instead of the high level types.

My refined suggestion was to have it operate freeform on a data blob: https://github.com/godotengine/godot-proposals/issues/290#issuecomment-1129697731

For SIMD to work most efficiently you need to structure the data in a SIMD friendly manner yes, so you can make best use of cache / linear reads / avoid shuffles etc.

We could support using SIMD from gdnative, but I think using it from gdscript could be possible too provided we can ensure the user checks what cpu features are available with a match and also do this runtime swap in their code of the feature set.

Again, unnecessary if all this can be handled transparently to the user.

if we decoupled the data from the math types in my suggestion we could even make use_double runtime capable if we made a parent type for real in gdscript. (at the expense of declaring a static pointer to the implementation and structure and declaring a "Real" type instead of raw float) if the user from gdscript uses mm256 (float) or mm256d and it is unsupported we can just report the CPU doesn't support it and assert, safely, in the error say "you didn't check get_supported_instructions()"

I'm not sure what this means. SIMD instructions can work on raw data, either with an aligned or unaligned load. There's no requirement for it to be stored in a struct as e.g. __mm128. This only corresponds to the registers the data is loaded into.

In your example @RevoluPowered , why should the gdscript user care whether this uses AVX or SSE or reference under the hood? Is there any need to expose this complexity to the user? Surely a wrapper can do all of this, as this proposal suggests?

func do_some_operation(input, input1):
   match OS.get_current_instruction_enum():
       AVX:
           inputs = concat(input1, input2)
           return _mm256_add_ps(inputs) 
       NO_SUPPORT:
           return input + input1

rossbridger commented 2 months ago

I think XSIMD is a good fit to be incorporated into Godot. It's a template library that abstracts SIMD instruction sets and allows for runtime detection of best available SIMD instruction sets.

Mantissa-23 commented 2 months ago

Sorry if this is too off-topic, but I feel like a reasonable "workaround" for those that need this now might be using the SIMD-accelerated .NET types. I have a feeling the overhead from transforming between Godot and .NET types might make it not worth it though.

Ivorforce commented 2 weeks ago

Hi, there's some impressive stuff in your proof of concept! I've just made one of my own - a project that basically attempts to be NumPy for Godot. You can find it here: https://github.com/Ivorforce/NumDot/tree/main

My proof of concept uses xtensor, which itself is accelerated by xsimd (the same xsimd as linked by @rossbridger). I have currently achieved up to 30x speed up when compared to gdscript itself; some performance is definitely lost by supporting variable data types, but I'm not sure if the code is even fully simd accelerated yet. It might be possible to get more out of it once I get around to checking if everything is actually set up correctly for xsimd.

It wasn't super complicated to bridge the interfaces into a gdextension, and thus the approach might be an option going forward. I definitely support Godot adopting some SIMD capabilities, where they most benefit speeds in the engine itself, though whether or not ndarrays should be supported in the core project is a harder sell.

godotengine / godot-proposals