Closed hsivonen closed 6 years ago
Thanks for the report @hsivonen, this is pretty much expected at this point.
The problem here is that b8x16
is actually just an <i8 x 16>
llvm vector. Because you can transmute any other <i8 x 16>
llvm vector into it, in this particular case, LLVM does not know that it only contains 0xff
s or 0x00
s.
If I add the following two functions:
pub fn stdsimd_eq_all_8x16(a: stdsimd::simd::i8x16, b: stdsimd::simd::i8x16) -> bool {
stdsimd_all_8x16(a.eq(b))
}
pub fn simd_eq_all_8x16(a: simd::i8x16, b: simd::i8x16) -> bool {
simd_all_8x16(a.eq(b))
}
You will see that they "pretty much" generate identical assembly (*).
While the current implementation makes it trivial to add custom implementations of and
/or
/xor
for the boolean vector types, I'd rather try a bit harder to get LLVM to emit proper code-gen first. If that does not work, then I'll add custom implementations for these EDIT: or just remove the reductions completely (they can be implemented in a crate external to stdsimd
, like faster
).
@rkruppe @alexcrichton I suspect that it would be enough to simd_cast
the boolean vectors into LLVM vectors of type <i1 x N>
's in the implementation of all/none/any
to significantly improve code-gen here.
Because you can transmute any other <i8 x 16> llvm vector into it,
Transmutes are unsafe
. If the programmer transmutes something to a boolean vector, it's up to the programmer to make sure the invariants of the type are upheld.
The whole point of having boolean vectors is to use the type system to communicate the precondition (all bits on a lane are the same) of any()
and all()
to enable guaranteed-_mm_movemask_epi8
-based any()
and all()
on SSE2.
in this particular case, LLVM does not know that it only contains 0xffs or 0x00s.
It seems more reliable to have manually-written any()
and all()
, since LLVM doesn't have the information that boolean vector types are trying to signal.
just remove the reductions completely (they can be implemented in a crate external
This would be unfortunate. In particular, the ARMv7+NEON implementation is subtle enough that it should be in the standard library to save application programmers from themselves. (It's still unclear to me if I got the simd
crate PR for the ARMv7+NEON versions right.) Also, it would be sad if adding support for SIMD on POWER8 required hunting around the crate ecosystem for outside-core
implementations of any()
and all()
instead of being able to provide implementations in core
.
Transmutes are unsafe. If the programmer transmutes something to a boolean vector, it's up to the programmer to make sure the invariants of the type are upheld.
As mentioned, the problem is that we are currently not telling LLVM what the invariants on std::simd
's boolean vectors are.
The whole point of having boolean vectors is to use the type system to communicate the precondition (all bits on a lane are the same) of any() and all() to enable guaranteed-_mm_movemask_epi8-based any() and all() on SSE2.
I strongly disagree. That's a consequence of boolean vectors, but it's not the point. std::simd
's boolean vectors try offer a way to convey to the optimizer that the code operates on vectors of boolean values, so that the optimizer can choose whatever instructions it deems fit for whatever ISA is being targeted.
This must be done in std
because the compiler intrinsics necessary for it are not available for third-party crates.
This would be unfortunate. [...] it would be sad [...] hunting around the crate ecosystem [...] It seems more reliable to have manually-written any() and all() [...] ARMv7+NEON [...]
Anyone can do this today on nightly:
pub struct b8x16(std::arch::x86::__m128i);
impl b8x16 { fn all(self) -> { ... } }
put it on a crate, and have it work on stable Rust soon.
The point of std
is to allow users to do these things on their crates, not to prevent them from having to use crates.io
every time they need to do X. So I fundamentally disagree with all you said there as well.
@gnzlbg for any
and all
it looks like we're using the "experimental" reductions in LLVM, maybe they're just not well tuned yet? Presumably we could go ahead and call the more optimized versions in particular cases, right? (aka the direct intrinsics)
Yeah the reduction intrinsics are new-ish and originally intended for ARM SVE if I remember correctly, so it wouldn't surprise me if the x86 backend simple doesn't have good lowering for them yet.
@alexcrichton
for any and all it looks like we're using the "experimental" reductions in LLVM, maybe they're just not well tuned yet?
The reductions are working perfectly. If you write:
fn foo(x: i8x8) -> i8 { x.and() }
you get pretty good assembly code for it:
punpcklbw xmm0, xmm0
pshufd xmm1, xmm0, 78
pand xmm1, xmm0
pshufd xmm0, xmm1, 229
pand xmm0, xmm1
movdqa xmm1, xmm0
psrld xmm1, 16
pand xmm1, xmm0
movd eax, xmm1
test al, al
setne al
The problem is that when @hsivonen writes:
fn foo(x: b8x8) -> bool { x.all() }
stdsimd
translates it into fn foo(x: i8x8) -> bool { x.and() != 0 }
and that is exactly what llvm is generating code for.
So at least in this case LLVM is doing exactly what we are telling it to do.
Also, if we call foo(a.eq(b))
then LLVM generates the instruction that @hsivonen wanted. Why? IMO it is because a.eq(b)
maps to the generic intrinsic of vector comparisons, which returns a boolean LLVM vector, and even though we are then zero-extending it to i8x8, LLVM is still able to know that the lanes of this i8x8 contain only 0x00 and 0xff which map to true
and false
, and is able to optimize the and
reduction from a sequence of shuffles into a single instruction (the one @hsivonen wanted).
Does this sound like something a buggy implementation would be able to do?
Presumably we could go ahead and call the more optimized versions in particular cases, right?
If this would be failing for one operation of one vector type on one ISA and this were a filled and acknowledged LLVM bug then adding a temporary workaround until the bug fix lands in Rust's LLVM would probably be the right call.
But this bug affects all portable vector operations for all boolean vectors on all ISAs, and considering all the cases above, it just looks like a stdsimd
/rustc
bug that we might be able to fix by just simd_cast
ing from the boolean vectors into i1xN
types and back before calling any llvm intrinsic on boolean vectors.
The whole point of having boolean vectors is to use the type system to communicate the precondition (all bits on a lane are the same) of any() and all() to enable guaranteed-_mm_movemask_epi8-based any() and all() on SSE2.
I strongly disagree. That's a consequence of boolean vectors, but it's not the point. std::simd's boolean vectors try offer a way to convey to the optimizer that the code operates on vectors of boolean values, so that the optimizer can choose whatever instructions it deems fit for whatever ISA is being targeted.
When the target is SSE2, could there possibly be an instructions other than pmovmskb
followed by a single ALU comparison that would be more performant? Is there any scenario where a failure to generate pmovmskb
(and no other SIMD instructions) when targeting SSE2 is not a performance bug?
Anyone can do this today on nightly:
pub struct b8x16(std::arch::x86::__m128i); impl b8x16 { fn all(self) -> { ... } }
put it on a crate, and have it work on stable Rust soon.
The point of
std
is to allow users to do these things on their crates, not to prevent them from having to usecrates.io
every time they need to do X. So I fundamentally disagree with all you said there as well.
If we look at std
as it exists prior to all SIMD stuff and try to infer a purpose from what's there, I don't think that the result is that stuff should be in std
only if it couldn't be implemented outside std
.
Clearly, std
provides portability beyond providing the primitives that would allow a crates.io crate provide portability using the primitives plus conditional compilation. Example: Abstracting away the fact that NT file paths are sequences of 16-bit units instead of being sequences of 8-bit units like on other mainstream platforms.
Furthermore, std
provides stuff that needs to be fast and correct even when there's no concern about portability abstractions or access to intrinsics. Example: Binary search.
In this light, it seems entirely reasonable to me to take the position that std
should provide any()
and all()
such that they are as fast as they can be and whose implementations get extended to cover new ISAs as Rust expands its ISA reach, and it's an std
bug if programmers working outside std
need to reach for std::arch
to get faster any()
and all()
.
But this bug affects all portable vector operations for all boolean vectors on all ISAs,
FWIW, in the case of aarch64, you might get away with switching to a different portable reduction across the vector that "happens" to map directly to a reduction that aarch64 provides a single instruction for.
and considering all the cases above, it just looks like a stdsimd/rustc bug that we might be able to fix by just simd_casting from the boolean vectors into i1xN types and back before calling any llvm intrinsic on boolean vectors.
Are the semantics of LLVM internal SIMD casts such that the cast itself trusts that the invariant of all bits in each lane being the same held so that there's no need to generate code to look at all the bits?
If this can be fixed within the constraint of rustc currently using LLVM 6 such that any()
and all()
consistently generate no more SIMD instructions than pmovmskb
when targeting SSE2, that's fine (and preferable!). If not, I think it's not OK to avoid doing the thing that the simd
crate does and wait for the generic LLVM-level fix. (That is, for the purpose of getting Firefox off the simd
crate and using stdsimd
, I think the performance regression that currently results is not OK.)
Here is an update. I've sent a rustc
PR (https://github.com/rust-lang/rust/pull/48983) and a stdsimd
PR (#371) that improve the situation from:
; stdsimd_all_8x16
movdqa (%rdi), %xmm0
pshufd $78, %xmm0, %xmm1
pand %xmm0, %xmm1
pshufd $229, %xmm1, %xmm0
pand %xmm1, %xmm0
movdqa %xmm0, %xmm1
psrld $16, %xmm1
pand %xmm0, %xmm1
movdqa %xmm1, %xmm0
psrlw $8, %xmm0
pand %xmm1, %xmm0
movd %xmm0, %eax
testb %al, %al
setne %al
to this
pand %xmm0, %xmm1 # xmm0 = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
pcmpeqb %xmm0, %xmm1
pmovmskb %xmm1, %eax
xorl %ecx, %ecx
cmpl $65535, %eax # imm = 0xFFFF
movl $-1, %eax
cmovnel %ecx, %eax
testb $1, %al
which is still far from perfect, but way better than the shuffle sequence that was being generated before.
The pand
with ones and the subsequent pcmpeqb
instructions are a no-op if the input is a boolean vector, but I haven't been able to trick LLVM into removing these yet. I've filled an LLVM bug about this here: https://bugs.llvm.org/show_bug.cgi?id=36702 We'll see what happens with that. Maybe it's not a bug and I am just missing the obvious way of telling LLVM about the invariants of the input.
The simd
crate only provides an implementation of all()
for bool8ix16
in the sse2
module, which should not be available if sse2
is not available. ~~However, compiling without sse2
still produces a pmovmskb
. ~~
How can this even happen?
EDIT: It's a bug: the simd
crate currently requires sse2
.
If you fix the current compilation error and then try to compile without it (e.g. via RUSTFLAGS="-C target-feature=-sse2"
) the crate does not compile because among other things, all()
for bool8ix16
is missing.
Also, the workarounds in the simd
crate should not work when compiling globally without sse2
and using #[target_feature]
to enable it locally. That is:
#[target_feature(enable = "sse2")]
pub unsafe fn foo(b: bool8ix16) -> bool {
b.all()
}
won't be able to use pmovmskb
because cfg(target_feature = "sse2")
inside the simd
crate is false
because the simd
crate was compiled without enabling sse2
globally.
Therefore, we can't really workaround these issues in either rustc
or stdsimd
without propagating #[target_feature]
across function calls such that cfg(target_feature)
s inside them are expanded again with updated feature sets and the functions re-compiled.
EDIT: for sse2
this does not matter much because it is enabled by default on the i686
and x86_64
tier 1 targets, but sse2
is just an example. This will matter for any other intrinsics for which LLVM might produce bad code.
For my use cases, I care about the 128-bit reductions being fast when everything is compiled with SSE2 enabled or when everything is compiled with NEON enabled.
For my use cases, I care about the 128-bit reductions being fast when everything is compiled with SSE2 enabled or when everything is compiled with NEON enabled.
Sure, I am preparing a PR with workarounds for those. Just be aware that if you start caring about AVX/AVX2 and you can't compile everything with that enabled then you are going to have to manually select which instructions to use until the upstream LLVM bugs are fixed because right now we can't really workaround these anywhere.
if you start caring about AVX/AVX2 and you can't compile everything with that enabled
Probably not even that. For those using std::simd
, they might need to actually recompile std
with avx2 enabled to benefit from this.
@hsivonen which arm targets are you using? I need to check whether they have NEON enabled in rustc by default.
which arm targets are you using?
Currently, I'm testing on aarch64-unknown-linux-gnu
, armv7-unknown-linux-gnueabihf
and occasionally armv7-unknown-linux-gnueabihf
with neon
feature added from the command line. The reason why I don't do the latter all the time is that the ARMv7+NEON is broken in the simd
crate and I've been hoping to switch to stdsimd
instead of fixing that properly.
Mozilla ships with armv7-linux-androideabi
and additionally builds nightlies but does not yet ship releases with aarch64-linux-android
.
I need to check whether they have NEON enabled in rustc by default.
$ rustc -Z unstable-options --target armv7-linux-androideabi --print target-spec-json | grep features
"features": "+v7,+thumb-mode,+thumb2,+vfp3,+d16,-neon",
$ rustc -Z unstable-options --target armv7-unknown-linux-gnueabihf --print target-spec-json | grep features
"features": "+v7,+vfp3,+d16,+thumb2,-neon",
C++ code in Firefox for ARMv7 Android is built with NEON unconditionally enabled, so I think Firefox would benefit from Rust re-introducing an ARMv7+NEON target both for Android and GNU/Linux (Android for shipping and GNU/Linux for testing).
@hsivonen so I am going to send a PR today that fixes this issue on master, and I want to ask you to try it by using stdsimd
master branch and let me know if it works.
You won't get good code-gen using std::simd
unless you recompile standard with neon enabled on both armv7
and aarch64
targets. That is, either you need to use xargo
, or we would need to add two new targets {armv7,aarch64}neon-unknown-linux-gnu
to rustc somehow (cc @alexcrichton what are our options here?)
Thank you for adding portable SIMD and boolean vectors!
In
stdsimd
,any()
andall()
are based LLVM cross-ISA on bitwise reductions across the vector, so these operations fail to use the type system information that is the reason for their existence: The information that all the bits of each lane are the same. I.e. inb8x16
, each lane is either 0 or 0xFF.This means that looking at all the bits is unnecessary, since for each lane, examining any one bit is sufficient. In the SSE2 case,
_mm_movemask_epi8
takes one bit from each lane of au8x16
. It's what thesimd
crate uses.I've uploaded a demo crate that exports uninlined versions of
any()
andall()
onu8x16
using bothsimd
andstdsimd
. The generated code for thesimd
crate versions is much simpler. Code generated usingstdsimd
should be as good.