rust-lang / portable-simd

The testing ground for the future of portable SIMD in Rust
Apache License 2.0
896 stars 80 forks source link

Support non-power-of-two vector lengths. #63

Closed calebzulawski closed 2 months ago

calebzulawski commented 3 years ago

In https://github.com/rust-lang/rust/pull/80652 I disabled non-power-of-two vector lengths to deconflict stdsimd and cg_clif development. Power-of-two vectors are typically sufficient, but there was at least one crate using length-3 vectors (https://github.com/yoanlcq/vek/issues/66). It would be nice to reenable these at some point, which would require support from cranelift. Perhaps a feature request should be filed with cranelift?

cc @bjorn3

thomcc commented 3 years ago

TBH I worry that supporting this will cause way more pain by complicating the semantics compared to the value in supporting the use cases.

  1. Concretely, with this we'd have explicit always-padding/uninitialized/undefined fields in some of the simd types.

    This feels dangerous, like it will lead to (likely latent) UB in peoples code the same way working with structs that have padding bytes is a headache in low level code, which I'd generally expect a lot of SIMD code to be.

  2. The underlying intrinsics will still read these padding/uninit fields them and mix them into the result in a way that might be problematic in a number of ways.

    Consider comparing 2 vec3s and producing a integer bitmask.

    The ideal ideal would probably be the least significant 3 bits set and the rest are zeros. Comparison would produce a 3xN mask with one uninitialized field, but conversion to in integer will produce 0b0000_UZYX, where U is uninitialized.

    That is to say, there's an undef bit, but as (at least) LLVM doesn't track this stuff at the bitwise level, the whole byte is undef... So, maybe simd_bitmask is implemented in a way to fix this, but I'd be very concerned, because as the real operations have to read the bit in question, it seems hard to guarantee robustly.

  3. I suspect these would be semantically second-class still. That is:

    There are several behaviors that are currently consistent across all vectors which now would not be consistent for these. Example: size being easily computable from the element count, alignment, repr(simd) being soundly transmutable to/from bytes...

    Several questions exist whose answer is: "it behaves like the next power of two up" that the actual behavior should probably be that way too. For example: everything about layout again, simd_ffi, asm, ...

  4. What if there's an arch in the future that ends up modeling non-POT vectors and then doesn't map well to the "uses next power of two up"? I think this is unlikely, but I think we should shy away from defining behavior if we can avoid it.

  5. While it will could make programming easier for this user, I strongly suspect that it will cause many future bugs for other users as people make assumptions about the number of elements being a multiple of the size, assume there are no padding bytes, ... (basically all the weird questions in 3)

  6. It seems to me like this wouldn't actually help veks code migrate to core::simd very much anyway, would it? repr(simd) isn't stabilizing with core::simd?

    If it isn't: to use core_simd, vek almost certainly wrap our types to define its own functionality, at which point it could hide 4th field of the f32x4 from the public the vec3.

    vek could even keep field access by doing the fields-only-exist-via-{Deref,DerefMut} trick: https://play.rust-lang.org/?version=nightly&mode=debug&edition=2018&gist=6b1799bb4796514d2a6823c519e71bb1

    If it is, I think a lot of design work has to be done on it that hasn't been begun.

That's all to say: I'm kind of... pretty strongly opposed. If we do this it should be a decision made carefully and weighing all these things (and perhaps more) against the benefits.


One note is: if #[repr(simd)] is staying unstable forever, I think it would be fine to support this if it's unobtrusive to a compiler. My objections center on:

  1. it being bad and weird in the core::simd API, and
  2. it being semantically very bizarre and the kind of thing that sounds better off avoiding rather than trying to specify.

None of these apply to unstable features, which I don't have strong feelings on.

calebzulawski commented 3 years ago

In my opinion, if you are doing things that depend on the layout of a type, you need to ensure that you are properly handling the layout. The vector types already impose architecture-specific alignment implications, I don't think padding implications are particularly onerous. If you want to determine the size of a type, you should use core::mem::size_of and not make assumptions.

I also think it should be pointed out that LLVM entirely supports non-power-of-two vectors (ignoring floating-point exceptions). The limitation here really just comes from cranelift.

Overall I don't think this is about vek, vek only illustrates that there is a desire for non-power-of-two vector lengths. I'm a little confused why it would be weird in the std::simd API, as far as the API is concerned I believe it should be completely length-agnostic.

I do think this is a non-critical issue though, and can definitely be ignored for the time being.

programmerjake commented 3 years ago

@thomcc

  1. What if there's an arch in the future that ends up modeling non-POT vectors and then doesn't map well to the "uses next power of two up"? I think this is unlikely, but I think we should shy away from defining behavior if we can avoid it.

both of RISC-V V and SimpleV natively support non-power-of-2 vector lengths; for SimpleV, the in-memory layout is the same as an array with the same length.

thomcc commented 3 years ago

both of RISC-V V and SimpleV natively support non-power-of-2 vector lengths; for SimpleV, the in-memory layout is the same as an array with the same length.

This is a far more compelling reason to support them IMO.

workingjubilee commented 3 years ago

Graphics code usually defers to the GPU, which has subtle but important differences from a typical SIMD register, but it is still useful to consider for our purposes, as a comparison point, if nothing else. And observationally GPUs handle Vec3 inputs by simply using a Vec4 in execution with a dummy value set in the 4th lane, but only use the first 3 lanes otherwise. The savings are on space on the hard drive and in RAM, basically, at the cost of complicating marshalling the data into the stream processors.

That actually a rather lot of programmers basically are okay with this quirk and, while there's some debate over whether or not they should just shove something in the spare element slot and get the advantage of fully aligned accesses when the data is on the CPU, they manage just fine without anything going particularly bad, seems to suggest we'll be okay.

Arm SVE is even more imminent as variable lane silicon that will be available on servers sometime Soon™ (I mean, it's probably already available in HPC land, but I mean for mere mortals who "just" have a big server) and it also can land on non-power-of-2 values for lane bitwidths, though admittedly, always a multiple of 128 bits.

workingjubilee commented 3 years ago

Filed https://github.com/bjorn3/rustc_codegen_cranelift/issues/1136 for now.

Also filing that reminded me that https://github.com/WebAssembly/flexible-vectors/ exists.

bjorn3 commented 3 years ago

Glam needs non-power-of-two vectors on SPIR-V.

XAMPPRocky commented 3 years ago

Hey, just wanted to note that the restriction introduced in https://github.com/rust-lang/rust/pull/80652 breaks a lot of code for the rust-gpu spirv-unknown-unknown target, and prevents us from upgrading our compiler to the latest nightly, as SPIR-V has length-3 vectors (x, y, z), which we're using through the glam. In Vulkan these are defined as having a size of 12 and an alignment of 16.

The stated reason for the restriction is the lack of support in Cranelift, however this isn't a concern for us, as our backend does support non-power-of-two SIMD types. Would it be possible to re-enable it just for the rustc_codegen_spirv backend or when compiling for the spirv-unknown-unknown target? This would allow us to upgrade our code on the latest nightly, while keeping the restriction for cranelift.

calebzulawski commented 3 years ago

Perhaps we should revert the power-of-two restriction (not the entire patch as it fixes an ICE) and just allow cg_clif to error if it encounters one? There doesn't seem to be an obvious solution other than keeping #[repr(simd)] perma-unstable (and a little hacky?) pending cranelift support. @workingjubilee suggested GPU targets as a potential user of non-power-of-two vectors from the start and it didn't seem to take long to run into that...

workingjubilee commented 3 years ago

I agree. Is cranelift only used for wasm targeting right now?

bjorn3 commented 3 years ago

Cranelift doesn't have a wasm backend. It does have a wasm frontend, which is used by for example Wasmtime. cg_clif, which is another Cranelift frontend, targets x86_64 and in the future AArch64.

thomcc commented 3 years ago

Vulkan these are defined as having a size of 12 and an alignment of 16

This is going to be tricky, since rust requires size >= alignment...

There's some discussion about it here https://rust-lang.zulipchat.com/#narrow/stream/122651-general/topic/Data.20layout.20.2F.20ABI.20expert.20needed, but it's possible another thread should be created if we need to support this.

workingjubilee commented 3 years ago

@thomcc Checking up on that thread, but where does Rust require size >= alignment?

thomcc commented 3 years ago

It's required by array layout. Each element is required to be size_of::<T>() apart, but if size < alignment, this would result in unaligned items.

workingjubilee commented 3 years ago

Hmm. I know that GPUs will execute over Vec3s with some multiple of 4 lanes anyways, but does code shuffling around such data en route to actual execution regularly assume the stride is 12 and not 16?

thomcc commented 3 years ago

It's required by array layout. Each element is required to be size_of::<T>() apart, but if size < alignment, this would result in unaligned items.

Apparently this isn't actually guaranteed, but I think it would probably require an RFC to change (and plausibly would break a lot of unsafe code in the wild — or rather, turn currently-sound interfaces to unsound-if-stride-and-size-dont-match ones)

programmerjake commented 3 years ago

Hmm. I know that GPUs will execute over Vec3s with some multiple of 4 lanes anyways, but does code shuffling around such data en route to actual execution regularly assume the stride is 12 and not 16?

That's often not true anymore (though it was usually true in the distant past, for OpenGL 2.x hardware), AMD's GPUs convert all vector operations to scalar operations on each SIMT thread iirc. So, a vec3 will just become 3 separate f32 variables.

thomcc commented 3 years ago

though it was usually true in the distant past, for OpenGL 2.x hardware

Hmm, my understanding is that it wasn't fully true here either, but it's been a while. But yes, what you're saying is definitely accurate for modern stuff. The float3/vec3/float4/vec4 types are just abstractions that don't model the actual hardware necessarily (the actual hardware is much much wider).

workingjubilee commented 3 years ago

Oh wow! That's absolutely wild.

programmerjake commented 3 years ago

So, my understanding is that for modern AMD GPUs, vectors in SPIR-V translate like this:

+-----------------------+
|      SPIR-V vec3      |
+-------+-------+-------+
| lane0 | lane1 | lane2 |
+-------+-------+-------+----+---------+
| 0.0   | 1.0   | 2.0   | 0  |         |
+-------+-------+-------+----+         |
| 3.0   | 4.0   | 5.0   | 1  |         |
+-------+-------+-------+----+         |
| 6.0   | 7.0   | 8.0   | 2  |         |
+-------+-------+-------+----+ HW lane |
| ...   | ...   | ...   |    |         |
+-------+-------+-------+----+         |
| 186.0 | 187.0 | 188.0 | 62 |         |
+-------+-------+-------+----+         |
| 189.0 | 190.0 | 191.0 | 63 |         |
+-------+-------+-------+----+---------+

So an add of 2 vec3 values would translate to AMDGPU assembly sorta like the following: GLSL source:

vec3 a, b, r;
r = a + b;

Assembly (made up mnemonics):

add.f32.vec64 r0, a0, b0
add.f32.vec64 r1, a1, b1
add.f32.vec64 r2, a2, b2
kiljacken commented 3 years ago

To add to the above, modern shader compilers (and other graphics workloads, fx. raytracers) tend to use their lanes as pseudo "threads", masking for control flow, as most work done for graphics is the exact same for several pixels/vertecies/etc so you get a lot better use out of the lanes even for scalar calculations.

It also allows you to much better abstract away the width of the hardware SIMD unit while keeping your code path looking mostly like plain scalar code.

workingjubilee commented 3 years ago

We determined at the meeting yesterday that breaking the SPIRV target that others were using is an anticipated but not desirable side effect so we will figure out how to revert this in rustc and handle the problem where necessary on the cg_clif side.

programmerjake commented 2 years ago

llvm is gaining code allowing vectors to be aligned to the element's alignment, rather than to the rounded-up size of the vector: https://lists.llvm.org/pipermail/llvm-dev/2021-December/154192.html

workingjubilee commented 2 years ago

Doesn't seem to have much traction yet. This thread has has a higher speculation to information ratio than I would like. How does LLVM currently lower length 3 vectors for targets like x86?

programmerjake commented 2 years ago

on x86, llvm rounds up to a supported vector length (it did for length 3), though sometimes it rounds down and handles the leftover separately (what it does for length 5 without avx). Note that in all cases the memory write is split into some set of valid writes because llvm isn't allowed to write to bytes that are out of range (cuz there could be an atomic or unmapped memory or some memory-mapped hardware there). https://rust.godbolt.org/z/3xEKvfxhP

example::add3:
        mov     rax, rdi
        movaps  xmm0, xmmword ptr [rsi]
        addps   xmm0, xmmword ptr [rdx]
        extractps       dword ptr [rdi + 8], xmm0, 2
        movlps  qword ptr [rdi], xmm0
        ret

example::add5:
        mov     rax, rdi
        movss   xmm0, dword ptr [rsi + 16]
        movaps  xmm1, xmmword ptr [rsi]
        addps   xmm1, xmmword ptr [rdx]
        addss   xmm0, dword ptr [rdx + 16]
        movss   dword ptr [rdi + 16], xmm0
        movaps  xmmword ptr [rdi], xmm1
        ret
workingjubilee commented 2 years ago

re: some older objections:

Concretely, with this we'd have explicit always-padding/uninitialized/undefined fields in some of the simd types.

Alas, people are already writing code that exhibits this problem via use of #[repr(simd)]. In our task we are charged with trying to resolve these problems, not necessarily to punt them to (other) Rust programmers. We may choose to do so, but it will not make the problem go away.

The underlying intrinsics will still read these padding/uninit fields them and mix them into the result in a way that might be problematic in a number of ways.

It seems LLVM already has an answer for this. It may not be the answer we want, but it is an answer.

What if there's an arch in the future that ends up modeling non-POT vectors and then doesn't map well to the "uses next power of two up"? I think this is unlikely, but I think we should shy away from defining behavior if we can avoid it.

Sure, the choice may be a mismatch for some obscure hardware architecture, but Rust already rejects many of those. It's not clear we can get away with ignoring non-pow2 vectors, as both SVE2 and RVV imply them, and SVE2 has begun shipping in consumer devices.

It seems to me like this wouldn't actually help veks code migrate to core::simd very much anyway, would it? repr(simd) isn't stabilizing with core::simd?

It's not, but it's also not actually clear that we shouldn't, in fact, commit to there being a single "SIMD container type" in the language which defines a behavior for how all vectors are stored, read, written, and passed, as an underlying device, like Layout or Box. Yes, vek can choose to wrap it somehow.

Apparently this isn't actually guaranteed, but I think it would probably require an RFC to change (and plausibly would break a lot of unsafe code in the wild — or rather, turn currently-sound interfaces to unsound-if-stride-and-size-dont-match ones)

"Sound but only by depending on implementation details that are not promised" is unsound in point of fact. The question is whether it should be made sound or not.

calebzulawski commented 2 years ago

all_lane_counts crate feature added in #300. We still need additional tests (particularly things like swizzles). Once we settle on layout (or make sure there are no issues with the current implementation), I think we can enable it in rustc.