rust-lang / packed_simd

Portable Packed SIMD Vectors for Rust standard library
https://rust-lang.github.io/packed_simd/packed_simd_2/
Apache License 2.0
600 stars 74 forks source link

Update the shuffle1_dyn API #228

Open gnzlbg opened 5 years ago

gnzlbg commented 5 years ago

Currently, shuffle1_dyn is a safe function, but if the index is out-of-bounds its behavior is undefined: it will panic! in some archs, and return an unspecified lane value in others - this is a bug (allowing safe Rust to have UB).

See https://github.com/WebAssembly/simd/issues/68 , https://github.com/WebAssembly/simd/issues/24, https://github.com/WebAssembly/simd/pull/71

where @zeux mentioned:

x86_64: If bit 7 is 1, set to 0, otherwise use lower 4 bits for lookup ARM: if index is out of range (0-15), set to 0 PPC64: use lower 4 bits for lookup (no out of range handling) MIPS: if bit 6 or bit 7 is 1, set to 0; otherwise use lower 4 bits for lookup (or rather use lower 5 bits for lookup into a table that has 32 elements constructed from 2 input vectors, but if both vectors are the same then it effectively means bits 4,5 are ignored) RISCV: if index is out of range (0-15), set to 0.

[...] if the specification says that all out-of-range indices return 0, then ARM and RISCV will use the native instruction, x86_64/PPC64/MIPS will have to generate additional checks using a 3-instruction sequence like "compare with 15; shuffle; andnot".

we could say that if index is 0x80 we return 0, if index is 0-15 we return the index result, otherwise the resulting byte is unspecified. This seems like the cleanest way to unify x86_64 behavior with the rest, resulting in zero-cost implementation for x86_64, ARM, MIPS, RISCV, and a 3-instruction sequence (compare to 0x80, shuffle, andnot) on PPC64.

reinerp commented 4 years ago

We should add an unsafe shuffle1_dyn_unchecked API that has undefined behavior if the index is out-of-bounds and not 0x80.

We should change shuffle1_dyn to panic! if any index is out-of-bounds and not 0x80 (that is, shuffle1_dyn checks the indices and calls shuffle1_dyn_unchecked).

Both APIs should set the resulting lane to 0 if the index is 0x80 - we can provide a: const SHUFFLE1_DYN_ZERO: usize = 0x80; constant for this.Both APIs should set the resulting lane to 0 if the index is 0x80 - we can provide a: const SHUFFLE1_DYN_ZERO: usize = 0x80; constant for this.

I'd prefer the following, which is equally as consistent across architectures, and is defined in more cases: instead of the "return-zero" case being ==0x80, instead make it >=0x80. On all of x86_64, ARM, MIPS, RISCV it has a zero-cost implementation. On PPC64 I'd guess (although I don't know that ISA) it still has a 3-instruction sequence (compare to 0x80, shuffle, andnot), but this time the comparison is >= instead of ==.

With >=0x80 the implementation is defined in more cases, which is more useful for users (admittedly it constrains future hardware implementors). An especially useful case is that we'd like an index of 0xFF to return zero; this comes up naturally when ORing with m8 mask objects.