Introduce "dynamic swizzling" into LLVMIR and Rust intrinsics

workingjubilee commented 2 years ago

There is a common instruction that performs what we refer to as a "swizzle" (or a variable, runtime-determined lookup-table indexing into another vector, also known as a "shuffle"), available on almost all the architectures we support. However, there is no way to express this portably in LLVMIR.

Nonetheless, the logic for lowering this to target-specific instructions should already be upstream in LLVM in the form of the lowering for the wasm "dynamic swizzling". As we would like to use it in our API directly, it should be altered to become sufficiently generic and available for all platforms, as functionally all platforms (including x86, when you consider sse3 and pshufb, so e.g. x86 Macs have it inhere in the target, as would e.g. an x86-64-v3 target) have a reasonable equivalent. Unfortunately working in C++ is challenging to begin with, and LLVM's dialect is even more arcane.

But, we can also potentially introduce this before any movement is seen in LLVM on our own side, via choosing our own lowerings for LLVMIR, using target-specific intrinsics or a generic scalar LUT pattern. This is the worst answer for x86 compilation, however, and ideally we would just use the LLVMIR intrinsic. But at least Cranelift should find adding this logic easy (as it is tilted towards serving wasm JIT compilation, and this IS a wasm instruction).

There was a relevant Zulip conversation here.

LLVM-side

[ ] Propose a new LLVMIR intrinsic that generalizes the wasm swizzle-lowering mechanism
[ ] PR it and get it merged with at least a generic desugaring
[ ] From there, reexport the lowering for llvm.wasm.swizzle to that intrinsic on x86-64...
[ ] ...AArch64...
[ ] ...PowerPC...
[ ] ...anything else.
[ ] ...Make sure it can lower back to llvm.wasm.swizzle for wasm targets.

Rust-side

[ ] Introduce platform intrinsic into backend + a generic LLVMIR lowering
[ ] Pipe it through portable-simd and thus core::simd
[ ] Introduce target-specific optimizations for AArch64 (optional)
[ ] Introduce target-specific optimizations for x86-64 + SSE3 (optional)
[ ] Introduce target-specific optimizations for x86-64 + AVX(2?) (optional)
[ ] Introduce target-specific optimizations for PowerPC (optional)
[ ] Introduce target-specific optimizations for wasm (optional)

workingjubilee commented 2 years ago

It should be noted this can also be seen as a weakening of the shufflevector instruction to accept a non-constant ("register") argument. However, an instruction is more deeply embedded into the logic of LLVM and altering an instruction may involve a change to the LLVM "bitcode" format, so alterations to an instruction are less likely to be accepted.

Thus, it is more likely to be accepted if defined as an LLVM intrinsic function, but this isn't terribly important from our perspective.

Arguably, it is also an instance of llvm.masked.gather.* but for loading from a register instead of memory. However, using that would involve storing, gather-loading, and then hoping mem2reg magically has an opt to clean up after us and into pshufb or vtbl. That's... quite a bit more magical than I would like.

workingjubilee commented 2 years ago

It seems the GCC backend can already do this essentially "as-is", so we might as well aim to implement the intrinsic first on the Rust side so that cg_gccjit can implement it as well. We also ought to start drawing intrinsics into cg_ssa.

programmerjake commented 2 years ago

one important subset of dynamic swizzling that we should probably have separate operations for is compress/expand, since, due to their requirement of not duplicating elements and not reordering elements are generally quite hard for a compiler to detect afaict. They can use more efficient instructions on some architectures (risc-v has a reg->reg vcompress instruction, for SimpleV compress/expand can be done as part of most unary instructions), also they have their element-selection input as a mask rather than a vector of indexes.

programmerjake commented 2 years ago

llvm has intrinsics for combined load/store and compress/expand, but doesn't yet have compress/expand as separate ops.

jhorstmann commented 2 years ago

one important subset of dynamic swizzling that we should probably have separate operations for is compress/expand, since, due to their requirement of not duplicating elements and not reordering elements are generally quite hard for a compiler to detect afaict

I was working on a prefix sum algortihm yesterday and was surprised that llvm actually was turning some of my permutes into expand instructions.

The pattern that was optimized looked like

_mm512_maskz_permutexvar_epi32(
    0b1111_1111_1111_1100,
    _mm512_set_epi32(13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, 0, 0),
    input)

The optimization is probably x86 specific and didn't happen for all permutes following that pattern. That might be because on x86 expand has a bit higher latency.

Regarding the "portable" branding, I'm wondering how portable swizzles with vectors > 128bits actually are. AVX2 AFAIK has only a 256bit swizzle for i32/f32 lanes (which can also be used to emulate i64/f64 swizzles) and even with AVX512 you need the Cannonlake/Icelake generation for >128bit byte and word swizzles. ARM Neon is AFAIK also limited to 128bit swizzles, I don't know the support status of SVE.

With that in mind, would it be reasonable to only "portably" support swizzles on 128bit vectors?

workingjubilee commented 2 years ago

With that in mind, would it be reasonable to only "portably" support swizzles on 128bit vectors?

That's not necessarily what is best, in actuality. If an "LLVM vector" is greater than what is effective with a "machine vector", LLVM is allowed to use that information to improve its scheduling as it interlaces multiple machine instructions to satisfy the request. This limit only makes sense if you see it as a 1 to 1 mapping between LLVM instructions and machine instructions, but that was never the case.

And from the Rust perspective this just adds another painful predicate that needs to be guaranteed in the source, with not much benefit if the programmer was just going to do that repeatedly over multiple 128 bit segments anyways.

The size limits we have in place now on vectors are more of a feature of LLVM inducing compilation errors at higher sizes and rustc not having the full generics capability we would like to express a more fluent boundary.

programmerjake commented 2 years ago

Imho the object of portable-simd isn't to support just what's widely available as a single instruction, but closer to what's available on at least a few cpus (or we otherwise deem important enough) and that llvm can produce correct code basically everywhere for (even if it isn't a single simd instruction).

FallingSnow commented 1 year ago

Is there a work around to getting a dynamic shuffle or is the best option to use runtime detection and _mm_shuffle_epi8, _mm256_shuffle_epi8, _mm512_shuffle_epi8, vqtbl1q_u8, vec_perm, or __builtin_shuffle?

workingjubilee commented 1 year ago

Wow, uh, after opening this issue... things became very busy in my life. But I'm back to the vector mines! And I decided to start things off in a slightly more roundabout way. In https://github.com/rust-lang/portable-simd/pull/334 I have introduced a demo for how to have byte-level dynamic swizzling for "one vector of bytes, one vector of index bytes" in wasm, AArch64, and x86, including SSSE3, AVX2, and AVX512VBMI feature levels, using "library code" (a pile of intrinsics).

The way I implemented the AVX2 version illuminates a path forward for more "arbitrary" implementations. It isn't the best codegen to be quite honest, but I looked at the scalar version and... woof. Still winning. In fact, the performance could probably get better if I went behind LLVM's back entirely, whipped out asm!, and hand-picked the instructions, but I want to have benches for that before I start in on it.

My intention is to introduce the intrinsic in Rust and have a desugaring step in our backend that does essentially what my library version does, hitting LLVM's "target intrinsics". Then, having written the code into our codegen, I'll try to port that from Rust to "LLVM C++".

So, @FallingSnow, the answer is that soon enough it'll be available as a function in our library. You'll still want to multiversion it, though.

rust-lang / portable-simd

Introduce "dynamic swizzling" into LLVMIR and Rust intrinsics #242

LLVM-side

Rust-side