rust-lang / stdarch

Rust's standard library vendor-specific APIs and run-time feature detection
https://doc.rust-lang.org/stable/core/arch/
Apache License 2.0
600 stars 265 forks source link

Boolean vectors should use hand-tuned any() and all() #362

Closed hsivonen closed 6 years ago

hsivonen commented 6 years ago

Thank you for adding portable SIMD and boolean vectors!

In stdsimd, any() and all() are based LLVM cross-ISA on bitwise reductions across the vector, so these operations fail to use the type system information that is the reason for their existence: The information that all the bits of each lane are the same. I.e. in b8x16, each lane is either 0 or 0xFF.

This means that looking at all the bits is unnecessary, since for each lane, examining any one bit is sufficient. In the SSE2 case, _mm_movemask_epi8 takes one bit from each lane of a u8x16. It's what the simd crate uses.

I've uploaded a demo crate that exports uninlined versions of any() and all() on u8x16 using both simd and stdsimd. The generated code for the simd crate versions is much simpler. Code generated using stdsimd should be as good.

    .text
    .file   "anyall0-66776763bbd7db1b14e447f414e2cf9a.rs"
    .section    .text.stdsimd_any_8x16,"ax",@progbits
    .globl  stdsimd_any_8x16
    .p2align    4, 0x90
    .type   stdsimd_any_8x16,@function
stdsimd_any_8x16:
    .cfi_startproc
    movdqa  (%rdi), %xmm0
    pshufd  $78, %xmm0, %xmm1
    por %xmm0, %xmm1
    pshufd  $229, %xmm1, %xmm0
    por %xmm1, %xmm0
    movdqa  %xmm0, %xmm1
    psrld   $16, %xmm1
    por %xmm0, %xmm1
    movdqa  %xmm1, %xmm0
    psrlw   $8, %xmm0
    por %xmm1, %xmm0
    movd    %xmm0, %eax
    testb   %al, %al
    setne   %al
    retq
.Lfunc_end0:
    .size   stdsimd_any_8x16, .Lfunc_end0-stdsimd_any_8x16
    .cfi_endproc

    .section    .text.stdsimd_all_8x16,"ax",@progbits
    .globl  stdsimd_all_8x16
    .p2align    4, 0x90
    .type   stdsimd_all_8x16,@function
stdsimd_all_8x16:
    .cfi_startproc
    movdqa  (%rdi), %xmm0
    pshufd  $78, %xmm0, %xmm1
    pand    %xmm0, %xmm1
    pshufd  $229, %xmm1, %xmm0
    pand    %xmm1, %xmm0
    movdqa  %xmm0, %xmm1
    psrld   $16, %xmm1
    pand    %xmm0, %xmm1
    movdqa  %xmm1, %xmm0
    psrlw   $8, %xmm0
    pand    %xmm1, %xmm0
    movd    %xmm0, %eax
    testb   %al, %al
    setne   %al
    retq
.Lfunc_end1:
    .size   stdsimd_all_8x16, .Lfunc_end1-stdsimd_all_8x16
    .cfi_endproc

    .section    .text.simd_any_8x16,"ax",@progbits
    .globl  simd_any_8x16
    .p2align    4, 0x90
    .type   simd_any_8x16,@function
simd_any_8x16:
    .cfi_startproc
    movdqa  (%rdi), %xmm0
    pmovmskb    %xmm0, %eax
    testl   %eax, %eax
    setne   %al
    retq
.Lfunc_end2:
    .size   simd_any_8x16, .Lfunc_end2-simd_any_8x16
    .cfi_endproc

    .section    .text.simd_all_8x16,"ax",@progbits
    .globl  simd_all_8x16
    .p2align    4, 0x90
    .type   simd_all_8x16,@function
simd_all_8x16:
    .cfi_startproc
    movdqa  (%rdi), %xmm0
    pmovmskb    %xmm0, %eax
    cmpl    $65535, %eax
    sete    %al
    retq
.Lfunc_end3:
    .size   simd_all_8x16, .Lfunc_end3-simd_all_8x16
    .cfi_endproc

    .section    ".note.GNU-stack","",@progbits
gnzlbg commented 6 years ago

Thanks for the report @hsivonen, this is pretty much expected at this point.

The problem here is that b8x16 is actually just an <i8 x 16> llvm vector. Because you can transmute any other <i8 x 16> llvm vector into it, in this particular case, LLVM does not know that it only contains 0xffs or 0x00s.

If I add the following two functions:

pub fn stdsimd_eq_all_8x16(a: stdsimd::simd::i8x16, b: stdsimd::simd::i8x16) -> bool {
    stdsimd_all_8x16(a.eq(b))
}

pub fn simd_eq_all_8x16(a: simd::i8x16, b: simd::i8x16) -> bool {
    simd_all_8x16(a.eq(b))
}

You will see that they "pretty much" generate identical assembly (*).

While the current implementation makes it trivial to add custom implementations of and/or/xor for the boolean vector types, I'd rather try a bit harder to get LLVM to emit proper code-gen first. If that does not work, then I'll add custom implementations for these EDIT: or just remove the reductions completely (they can be implemented in a crate external to stdsimd, like faster).


@rkruppe @alexcrichton I suspect that it would be enough to simd_cast the boolean vectors into LLVM vectors of type <i1 x N>'s in the implementation of all/none/any to significantly improve code-gen here.


hsivonen commented 6 years ago

Because you can transmute any other <i8 x 16> llvm vector into it,

Transmutes are unsafe. If the programmer transmutes something to a boolean vector, it's up to the programmer to make sure the invariants of the type are upheld.

The whole point of having boolean vectors is to use the type system to communicate the precondition (all bits on a lane are the same) of any() and all() to enable guaranteed-_mm_movemask_epi8-based any() and all() on SSE2.

in this particular case, LLVM does not know that it only contains 0xffs or 0x00s.

It seems more reliable to have manually-written any() and all(), since LLVM doesn't have the information that boolean vector types are trying to signal.

just remove the reductions completely (they can be implemented in a crate external

This would be unfortunate. In particular, the ARMv7+NEON implementation is subtle enough that it should be in the standard library to save application programmers from themselves. (It's still unclear to me if I got the simd crate PR for the ARMv7+NEON versions right.) Also, it would be sad if adding support for SIMD on POWER8 required hunting around the crate ecosystem for outside-core implementations of any() and all() instead of being able to provide implementations in core.

gnzlbg commented 6 years ago

Transmutes are unsafe. If the programmer transmutes something to a boolean vector, it's up to the programmer to make sure the invariants of the type are upheld.

As mentioned, the problem is that we are currently not telling LLVM what the invariants on std::simd's boolean vectors are.

The whole point of having boolean vectors is to use the type system to communicate the precondition (all bits on a lane are the same) of any() and all() to enable guaranteed-_mm_movemask_epi8-based any() and all() on SSE2.

I strongly disagree. That's a consequence of boolean vectors, but it's not the point. std::simd's boolean vectors try offer a way to convey to the optimizer that the code operates on vectors of boolean values, so that the optimizer can choose whatever instructions it deems fit for whatever ISA is being targeted.

This must be done in std because the compiler intrinsics necessary for it are not available for third-party crates.

This would be unfortunate. [...] it would be sad [...] hunting around the crate ecosystem [...] It seems more reliable to have manually-written any() and all() [...] ARMv7+NEON [...]

Anyone can do this today on nightly:

pub struct b8x16(std::arch::x86::__m128i);
impl b8x16 { fn all(self) -> { ... } } 

put it on a crate, and have it work on stable Rust soon.

The point of std is to allow users to do these things on their crates, not to prevent them from having to use crates.io every time they need to do X. So I fundamentally disagree with all you said there as well.

alexcrichton commented 6 years ago

@gnzlbg for any and all it looks like we're using the "experimental" reductions in LLVM, maybe they're just not well tuned yet? Presumably we could go ahead and call the more optimized versions in particular cases, right? (aka the direct intrinsics)

hanna-kruppe commented 6 years ago

Yeah the reduction intrinsics are new-ish and originally intended for ARM SVE if I remember correctly, so it wouldn't surprise me if the x86 backend simple doesn't have good lowering for them yet.

gnzlbg commented 6 years ago

@alexcrichton

for any and all it looks like we're using the "experimental" reductions in LLVM, maybe they're just not well tuned yet?

The reductions are working perfectly. If you write:

fn foo(x: i8x8) -> i8 { x.and() }

you get pretty good assembly code for it:

 punpcklbw xmm0, xmm0
 pshufd  xmm1, xmm0, 78
 pand    xmm1, xmm0
 pshufd  xmm0, xmm1, 229
 pand    xmm0, xmm1
 movdqa  xmm1, xmm0
 psrld   xmm1, 16
 pand    xmm1, xmm0
 movd    eax, xmm1
 test    al, al
 setne   al

The problem is that when @hsivonen writes:

fn foo(x: b8x8) -> bool { x.all() }

stdsimd translates it into fn foo(x: i8x8) -> bool { x.and() != 0 } and that is exactly what llvm is generating code for.

So at least in this case LLVM is doing exactly what we are telling it to do.

Also, if we call foo(a.eq(b)) then LLVM generates the instruction that @hsivonen wanted. Why? IMO it is because a.eq(b) maps to the generic intrinsic of vector comparisons, which returns a boolean LLVM vector, and even though we are then zero-extending it to i8x8, LLVM is still able to know that the lanes of this i8x8 contain only 0x00 and 0xff which map to true and false, and is able to optimize the and reduction from a sequence of shuffles into a single instruction (the one @hsivonen wanted).

Does this sound like something a buggy implementation would be able to do?


Presumably we could go ahead and call the more optimized versions in particular cases, right?

If this would be failing for one operation of one vector type on one ISA and this were a filled and acknowledged LLVM bug then adding a temporary workaround until the bug fix lands in Rust's LLVM would probably be the right call.

But this bug affects all portable vector operations for all boolean vectors on all ISAs, and considering all the cases above, it just looks like a stdsimd/rustc bug that we might be able to fix by just simd_casting from the boolean vectors into i1xN types and back before calling any llvm intrinsic on boolean vectors.

hsivonen commented 6 years ago

The whole point of having boolean vectors is to use the type system to communicate the precondition (all bits on a lane are the same) of any() and all() to enable guaranteed-_mm_movemask_epi8-based any() and all() on SSE2.

I strongly disagree. That's a consequence of boolean vectors, but it's not the point. std::simd's boolean vectors try offer a way to convey to the optimizer that the code operates on vectors of boolean values, so that the optimizer can choose whatever instructions it deems fit for whatever ISA is being targeted.

When the target is SSE2, could there possibly be an instructions other than pmovmskb followed by a single ALU comparison that would be more performant? Is there any scenario where a failure to generate pmovmskb (and no other SIMD instructions) when targeting SSE2 is not a performance bug?

Anyone can do this today on nightly:

pub struct b8x16(std::arch::x86::__m128i);
impl b8x16 { fn all(self) -> { ... } } 

put it on a crate, and have it work on stable Rust soon.

The point of std is to allow users to do these things on their crates, not to prevent them from having to use crates.io every time they need to do X. So I fundamentally disagree with all you said there as well.

If we look at std as it exists prior to all SIMD stuff and try to infer a purpose from what's there, I don't think that the result is that stuff should be in std only if it couldn't be implemented outside std.

Clearly, std provides portability beyond providing the primitives that would allow a crates.io crate provide portability using the primitives plus conditional compilation. Example: Abstracting away the fact that NT file paths are sequences of 16-bit units instead of being sequences of 8-bit units like on other mainstream platforms.

Furthermore, std provides stuff that needs to be fast and correct even when there's no concern about portability abstractions or access to intrinsics. Example: Binary search.

In this light, it seems entirely reasonable to me to take the position that std should provide any() and all() such that they are as fast as they can be and whose implementations get extended to cover new ISAs as Rust expands its ISA reach, and it's an std bug if programmers working outside std need to reach for std::arch to get faster any() and all().

But this bug affects all portable vector operations for all boolean vectors on all ISAs,

FWIW, in the case of aarch64, you might get away with switching to a different portable reduction across the vector that "happens" to map directly to a reduction that aarch64 provides a single instruction for.

and considering all the cases above, it just looks like a stdsimd/rustc bug that we might be able to fix by just simd_casting from the boolean vectors into i1xN types and back before calling any llvm intrinsic on boolean vectors.

Are the semantics of LLVM internal SIMD casts such that the cast itself trusts that the invariant of all bits in each lane being the same held so that there's no need to generate code to look at all the bits?

If this can be fixed within the constraint of rustc currently using LLVM 6 such that any() and all() consistently generate no more SIMD instructions than pmovmskb when targeting SSE2, that's fine (and preferable!). If not, I think it's not OK to avoid doing the thing that the simd crate does and wait for the generic LLVM-level fix. (That is, for the purpose of getting Firefox off the simd crate and using stdsimd, I think the performance regression that currently results is not OK.)

gnzlbg commented 6 years ago

Here is an update. I've sent a rustc PR (https://github.com/rust-lang/rust/pull/48983) and a stdsimd PR (#371) that improve the situation from:

; stdsimd_all_8x16
movdqa  (%rdi), %xmm0
    pshufd  $78, %xmm0, %xmm1
    pand    %xmm0, %xmm1
    pshufd  $229, %xmm1, %xmm0
    pand    %xmm1, %xmm0
    movdqa  %xmm0, %xmm1
    psrld   $16, %xmm1
    pand    %xmm0, %xmm1
    movdqa  %xmm1, %xmm0
    psrlw   $8, %xmm0
    pand    %xmm1, %xmm0
    movd    %xmm0, %eax
    testb   %al, %al
    setne   %al

to this

  pand %xmm0, %xmm1 # xmm0 = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
  pcmpeqb %xmm0, %xmm1
  pmovmskb %xmm1, %eax
  xorl %ecx, %ecx
  cmpl $65535, %eax # imm = 0xFFFF
  movl $-1, %eax
  cmovnel %ecx, %eax
  testb $1, %al

which is still far from perfect, but way better than the shuffle sequence that was being generated before.

The pand with ones and the subsequent pcmpeqb instructions are a no-op if the input is a boolean vector, but I haven't been able to trick LLVM into removing these yet. I've filled an LLVM bug about this here: https://bugs.llvm.org/show_bug.cgi?id=36702 We'll see what happens with that. Maybe it's not a bug and I am just missing the obvious way of telling LLVM about the invariants of the input.

gnzlbg commented 6 years ago

The simd crate only provides an implementation of all() for bool8ix16 in the sse2 module, which should not be available if sse2 is not available. ~~However, compiling without sse2 still produces a pmovmskb. ~~

How can this even happen?

EDIT: It's a bug: the simd crate currently requires sse2.

If you fix the current compilation error and then try to compile without it (e.g. via RUSTFLAGS="-C target-feature=-sse2") the crate does not compile because among other things, all() for bool8ix16 is missing.

gnzlbg commented 6 years ago

Also, the workarounds in the simd crate should not work when compiling globally without sse2 and using #[target_feature] to enable it locally. That is:

#[target_feature(enable = "sse2")]
pub unsafe fn foo(b: bool8ix16) -> bool {
    b.all()
}

won't be able to use pmovmskb because cfg(target_feature = "sse2") inside the simd crate is false because the simd crate was compiled without enabling sse2 globally.

Therefore, we can't really workaround these issues in either rustc or stdsimd without propagating #[target_feature] across function calls such that cfg(target_feature)s inside them are expanded again with updated feature sets and the functions re-compiled.


EDIT: for sse2 this does not matter much because it is enabled by default on the i686 and x86_64 tier 1 targets, but sse2 is just an example. This will matter for any other intrinsics for which LLVM might produce bad code.

hsivonen commented 6 years ago

For my use cases, I care about the 128-bit reductions being fast when everything is compiled with SSE2 enabled or when everything is compiled with NEON enabled.

gnzlbg commented 6 years ago

For my use cases, I care about the 128-bit reductions being fast when everything is compiled with SSE2 enabled or when everything is compiled with NEON enabled.

Sure, I am preparing a PR with workarounds for those. Just be aware that if you start caring about AVX/AVX2 and you can't compile everything with that enabled then you are going to have to manually select which instructions to use until the upstream LLVM bugs are fixed because right now we can't really workaround these anywhere.

gnzlbg commented 6 years ago

if you start caring about AVX/AVX2 and you can't compile everything with that enabled

Probably not even that. For those using std::simd, they might need to actually recompile std with avx2 enabled to benefit from this.

gnzlbg commented 6 years ago

@hsivonen which arm targets are you using? I need to check whether they have NEON enabled in rustc by default.

hsivonen commented 6 years ago

which arm targets are you using?

Currently, I'm testing on aarch64-unknown-linux-gnu, armv7-unknown-linux-gnueabihf and occasionally armv7-unknown-linux-gnueabihf with neon feature added from the command line. The reason why I don't do the latter all the time is that the ARMv7+NEON is broken in the simd crate and I've been hoping to switch to stdsimd instead of fixing that properly.

Mozilla ships with armv7-linux-androideabi and additionally builds nightlies but does not yet ship releases with aarch64-linux-android.

I need to check whether they have NEON enabled in rustc by default.

$ rustc -Z unstable-options --target armv7-linux-androideabi --print target-spec-json | grep features
  "features": "+v7,+thumb-mode,+thumb2,+vfp3,+d16,-neon",
$ rustc -Z unstable-options --target armv7-unknown-linux-gnueabihf --print target-spec-json | grep features
  "features": "+v7,+vfp3,+d16,+thumb2,-neon",

C++ code in Firefox for ARMv7 Android is built with NEON unconditionally enabled, so I think Firefox would benefit from Rust re-introducing an ARMv7+NEON target both for Android and GNU/Linux (Android for shipping and GNU/Linux for testing).

gnzlbg commented 6 years ago

@hsivonen so I am going to send a PR today that fixes this issue on master, and I want to ask you to try it by using stdsimd master branch and let me know if it works.

You won't get good code-gen using std::simd unless you recompile standard with neon enabled on both armv7 and aarch64 targets. That is, either you need to use xargo, or we would need to add two new targets {armv7,aarch64}neon-unknown-linux-gnu to rustc somehow (cc @alexcrichton what are our options here?)

hsivonen commented 6 years ago

Thanks. I filed an issue about NEON-enabled ARMv7 targets and I intend to try to write a PR for it now. (aarch64 should be fine already.)