Please support non-2^N SIMD lane counts

rust-lang / rustc_codegen_cranelift

Cranelift based backend for rustc

Apache License 2.0

1.61k stars 100 forks source link

Please support non-2^N SIMD lane counts #1136

Open workingjubilee opened 3 years ago

workingjubilee commented 3 years ago

The silicon that supports these more or less directly: GPUs handle Vec3s (f32x3 typically) all the time already. Arm SVE supports 384-bit width vector registers and is available Soon™. RISCVV will eventually exist and support arbitrary-width vectors, somewhere, over the rainbow:rainbow: someday:musical_note:...

LLVM's approach for handling these when only fixed width vector registers are available to compile to was, as far as I could tell, and as described by the author of the vek crate, an approach similar to the one GPUs use: use 128-bit registers just fine but politely ignore the unspecified lanes when the "Vec3" types are loaded and stored.

Also the https://github.com/WebAssembly/flexible-vectors/ proposal exists, though is currently in a fairly nascent state. Still, another point to this being a long-term desirable even if it's not immediately needed.

bjorn3 commented 3 years ago

The loads and stores generated by LLVM are pretty inefficient:

#![feature(platform_intrinsics)]
#![feature(repr_simd)]

#[derive(Copy,Clone)]
#[repr(simd)]
pub struct Foo(u8, u8, u8);

extern "platform-intrinsic" {
    fn simd_add<T>(a: T, b: T) -> T;
}

pub fn add_foo(a: Foo, b:Foo) -> Foo {
    unsafe { simd_add(a,b) }
}

playground::add_foo: # @playground::add_foo
# %bb.0:
    movq    %rdi, %rax
    movd    (%rsi), %xmm0                   # xmm0 = mem[0],zero,zero,zero
    movd    (%rdx), %xmm1                   # xmm1 = mem[0],zero,zero,zero
    paddb   %xmm0, %xmm1
    movdqa  %xmm1, -24(%rsp)
    movb    -22(%rsp), %cl
    movb    %cl, 2(%rdi)
    movd    %xmm1, %ecx
    movw    %cx, (%rdi)
    retq
                                        # -- End function

If Cranelift won't support them, cg_clif will need to load and store for each simd operation even with maximal inlining as cg_clif only supports keeping types representable using one or two cranelift values in registers. The rest is forced to the stack.

bjorn3 commented 3 years ago

https://bytecodealliance.zulipchat.com/#narrow/stream/217117-cranelift/topic/non.20power-of-two.20vector.20sizes/near/225603988

workingjubilee commented 3 years ago

Yeah, honestly I was trying to read LLVM's generated SIMD assembly and I went pretty cross-eyed a few times trying to follow all the extra work being done, so while I filed this request I also do very much think it's important that whatever is done that it not unduly pessimize the "ordinary" NEON/SSE cases that use e.g. f32x4s, and I appreciate the engineering challenge that this makes for.

programmerjake commented 3 years ago

Libre-SOC is planning on adding cranelift support for SimpleV, our vectors extension for OpenPower, it supports all vector lengths from 1 to 64, including non-power-of-2 lengths. It also supports dynamically variable vector lengths.