raygon-renderer / thermite

Thermite SIMD: Melt your CPU
https://raygon-renderer.github.io/thermite/
Apache License 2.0
155 stars 2 forks source link
algorithms math optimizations rust simd structure-of-arrays

Thermite SIMD: Melt your CPU

NOTE: This crate is not yet on crates.io, but I do own the name and will publish it there when ready

Thermite is a WIP SIMD library focused on providing portable SIMD acceleration of SoA (Structure of Arrays) algorithms, using consistent-length1 SIMD vectors for lockstep iteration and computation.

Thermite provides highly optimized feature-rich backends for SSE2, SSE4.2, AVX and AVX2, with planned support for AVX512, ARM/Aarch64 NEON, and WASM SIMD extensions.

In addition to that, Thermite includes a highly optimized vectorized math library with many special math functions and algorithms, specialized for both single and double precision.

1 All vectors in an instruction set are the same length, regardless of size.

Current Status

Refer to issue #1

Motivation and Goals

Thermite was conceived while working on Raygon renderer, when it was decided we needed a state of the art high-performance SIMD vector library focused on facilitating SoA algorithms. Using SIMD for AoS values was a nightmare, constantly shuffling vectors and performing unnecessary horizontal operations. We also weren't able to take advantage of AVX2 fully due to 3D vectors only using 3 or 4 lanes of a regular 128-bit register.

Using SIMDeez, faster, or redesigning packed_simd were all considered, but each has their flaws. SIMDeez is rather limited in functionality, and their handling of target_feature leaves much to be desired. faster fits well into the SoA paradigm, but the iterator-based API is rather unwieldy, and it is lacking many features. packed_simd isn't bad, but it's also missing many features and relies on the Nightly-only "platform-intrinsic"s, which can produce suboptimal code in some cases.

Therefore, the only solution was to write my own, and thus Thermite was born.

The primary goal of Thermite is to provide optimal codegen for every backend instruction set, and provide a consistent set of features on top of all of them, in such a way as to encourage using chunked SoA or AoSoA algorithms regardless of what data types you need. Furthermore, with the #[dispatch] macro, multiple instruction sets can be easily targetted within a single binary.

Features

Optimized Project Setup

For optimal performance, ensure you Cargo.toml profiles looks something like this:

[profile.dev]
opt-level = 2       # Required to inline SIMD intrinsics internally

[profile.release]
opt-level = 3       # Should be at least 2; level 1 will not use SIMD intrinsics
lto = 'thin'        # 'fat' LTO may also improve things, but will increase compile time
codegen-units = 1   # Required for optimal inlining and optimizations

# optional release options depending on your project and preference
incremental = false # Release builds will take longer to compile, but inter-crate optimizations may work better
panic = 'abort'     # Very few functions in Thermite panic, but aborting will avoid the unwind mechanism overhead

Misc. Usage Notes

Cargo --features

alloc (enabled by default)

The alloc feature enables aligned allocation of buffers suitable to reading/writing to with SIMD.

nightly

The nightly feature enables nightly-only optimizations such as accelerated half-precision encoding/decoding.

math (enabled by default)

Enables the vectorized math modules

rng

Enables the vectorized random number modules

emulate_fma

Real fused multiply-add instructions are only enabled for AVX2 platforms. However, as FMA is used not only for performance but for its extended precision, falling back to a split multiply and addition will incur two rounding errors, and may be unacceptable for some applications. Therefore, the emulate_fma Cargo feature will enable a slower but more accurate implementation on older platforms.

For single-precision floats, this is easiest done by simply casting it to double-precision, doing seperate multiply and additions, then casting back. For double-precision, it will use an infinite-precision implementation based on libm.

On SSE2 platforms, double-precision may fallback to scalar ops, as the effort needed to make it branchless will be more expensive than not. As of writing this, it has not been implemented, so benchmarks will reveal what is needed later.