Shl and Shr implementations should accept a (polymorphic?) scalar argument rather than a vector

fleabitdev commented 2 years ago

The operation "shift with a different offset in each lane" is not portable: it's supported by NEON, unsupported by SSE 4.1 and Wasm, and only partially-supported by AVX2. It would probably make more sense for types like Simd<i32, LANES> to implement Shl<i32>, rather than Shl<Simd<i32, LANES>>.

Scalar integer types currently implement Shl and Shr for all other integer types, so that might be worth considering. One downside is that it would make the struct Simd rustdoc page even longer.

Lokathor commented 2 years ago

If all lanes are the same then llvm does the expected thing and it "just works out".

fleabitdev commented 2 years ago

If that's the only happy path, isn't it better to encode it in the typesystem?


struct RoundingDivide {
    to_add: u16x8,
    to_shift: u16x8,
}

impl RoundingDivide {
    //a conscientious programmer pre-splats their shift offsets, "for performance"...
    fn new(to_shift: u16) -> RoundingDivide {
        RoundingDivide {
            to_add: u16x8::splat(1 << (to_shift - 1)),
            to_shift: u16x8::splat(to_shift)
        }
    }

    //...unknowingly risking a large, counterintuitive performance penalty here
    fn divide(&self, arg: u16x8) -> u16x8 {
        (arg + self.to_add) >> self.to_shift
    }
}

Lokathor commented 2 years ago

Well my thoughts were about like this:

If you need to shift each lane by the same amount, splatting that amount across a same-size simd value and then shifting by simd will work out right with llvm.
If you need to shift each lane by a separate value then you need to shift by a simd, and the fact that hardware doesn't support that does not actually matter because that's what you need to be doing so telling llvm "figure it out and do it for me already" is actually what you wanted anyway.

That said, now that you mention it, having a splatted simd value that llvm can't backtrack to understand that it's all lanes the same value, thus missing the optimization, does really suck.

So yeah, now I agree that shift by scalar should be provided.

If only rustdoc was able to be better about trait impl bloat on pages.

calebzulawski commented 2 years ago

It may be desirable from a usability perspective, but something like a << Simd::splat(b) is perfectly fine, and in fact this is the underlying LLVM that is produced. LLVM is capable of optimizing shifts when all lanes are the same value.

Shifting by a vector is not strictly unsupported on other platforms--it's implemented with blends, such as pshufb on x86-64.

Lokathor commented 2 years ago

That's exactly what was just discussed.

However, does llvm produce the good code when you didn't immediately splat the value? If instead you've passed in a simd value that was previously splatted elsewhere, perhaps during some long-ago constructor operation?

calebzulawski commented 2 years ago

Sorry--what I am saying isn't just that LLVM can optimize it, it's that LLVM doesn't even have a concept of "shift a vector by a scalar". All vector shifts are done by other vectors, and optimized when possible.

Lokathor commented 2 years ago

Sure. That makes sense.

However, if we offer a scalar shift to the programmer and then just internally the library splats the value anyway, that is much more likely to make the user stay on the happy path where we're confident that llvm will do the optimization.

Because otherwise it's easy for the user to write code like the example. Code where they think that they've helpfully "pre splatted" the value, thus saving work at the use site. However, instead they've just accidentally tricked llvm out of seeing an optimization.

And the only cost to having scalar shifts that splat internally is a small docs cost (and we can maybe ask the rustdoc folks to help fix that somehow).

fleabitdev commented 2 years ago

@calebzulawski: You make a good point that removing the shift-by-vector API would make some types of shift more difficult to implement.

//this should compile to two shifts and a blend
fn interleaved_shift(x: u16x8, ev: u16, od: u16) -> u16x8 {
    x << u16x8::from_array([ev, od, ev, od, ev, od, ev, od])
}

On x86-64, is there any general implementation of u16x8::shl(u16x8), accepting non-constant inputs, which you'd expect to be less than eight times slower than a constant shift?

Lokathor commented 2 years ago

Oh to be clear I don't support removing the vector shift at all.

I just think we should add scalar shifting without removing anything.

fleabitdev commented 2 years ago

Has there been any previous discussion of this library's portability goals?

Speaking as a user, if std::simd provides APIs which are an order of magnitude slower on x86-64, that seems like a serious footgun. However, I'm conscious that Wasm SIMD includes instructions like i8x16.bitmask which are quite slow on NEON (WebAssembly/simd#131). I'm not sure where the line is being drawn.

Lokathor commented 2 years ago

Offering operations which are only sure to be fast on all devices is basically not an option. There would be gaps all over. Even if we pretend there's just three types of device (wasm, neon, and avx2, for example), taking only the intersection of those three would still leave plenty of unpleasant gaps.

So we're resigned from the start to simply offering a good api first, and hardware consideration comes in second.

calebzulawski commented 2 years ago

To add onto that, we never make the assumption that any particular operation will compile to a single instruction. In some cases, like this one, there are operations which may not be a single instruction, but still benefit from the compiler's ability to produce the optimal implementation.

workingjubilee commented 2 years ago

The sense that std::simd is portable includes "portable to machines which don't implement vector operations at all".

The goal is really that when used naively for a program including a random assortment of mathematical operations, std::simd will offer a speedup when targeting a common vector-register-having architecture, and that ideally you can interlace it with architectural intrinsics or inline assembly to cover odd niches, and that sometimes it will scalarize an operation and That's Fine, Actually, for reasons Lokathor already noted (namely: you are stuck doing it). Making 80% of a calculation loop +300% as fast (numbers pulled out of a hat) still can be a +240% total speedup in the performance critical segment.

But I think there is a generous assumption that "compiles to a single instruction" means "executes in O(1) time". That's not actually the case. There are SIMD instructions that do execute in ~O(n) time on some (or even all) microarchitectures. Now, obviously, no one expects VGATHER*PS to be fast, but it is also particularly common for early AVX processors to have the trait of "the VEX-prefixed op (so, using ymm registers) takes twice as much time, so no actual gain over xmm", even on common arithmetic ops.

But, of course, if the compiler selects that "slow" operation for some code, that means the same code accelerates on later microarchitectures. Meanwhile, a JIT compiler like the one used for dispatching WebAssembly instructions will have intimate knowledge of processor at runtime, so can always in theory choose the code that is fastest now, but has to make that instruction selection quickly. So you can see, even from that simple example, we have different design pressures for improving performance in most cases, and thus our definition of "portability" is sculpted to be different than WebAssembly SIMD's version.

ghost commented 2 years ago

Is there any consensus here regarding whether Shl<Rhs = Int>/Shr<Rhs = Int> should be implemented for Simd? (Would a PR be welcome?)

Although, at a minimum, shouldn't only Shl<Rhs = u8>/Shr<Rhs = u8> for Simd be strictly necessary, considering that the largest Simd-supported integer width is 64-bits? I consider Rust stdlib to be weird in this aspect. But of course, consistency ought to be maintained...

rust-lang / portable-simd

Shl and Shr implementations should accept a (polymorphic?) scalar argument rather than a vector #225