Open hsivonen opened 3 years ago
packed-simd uses specific arm instructions which we wont be able to do (since target_feature doesn't apply to std). stdsimd uses an integer "or" reduction, so unfortunately LLVM doesn't know that the vector must contain only 0 or -1. Perhaps we can add an intrinsic that does a truncation before the reduction and hope that it is smart enough to produce similar code.
Doesn't target_feature
apply to std
even when compiling std
? I expected it to work when NEON-availability is part of the target as with thumbv7neon-unknown-linux-gnueabihf
and not an application compile-time option.
Good point--i missed the neon in the target. It would work for this particular target, but not for any other arm target that doesn't have neon baked in (even if using the target_feature attribute or compiler flag). I'm guessing other architectures likely see a similar codegen issue, as well.
packed-simd uses specific arm instructions which we wont be able to do (since target_feature doesn't apply to std). stdsimd uses an integer "or" reduction, so unfortunately LLVM doesn't know that the vector must contain only 0 or -1. Perhaps we can add an intrinsic that does a truncation before the reduction and hope that it is smart enough to produce similar code.
LLVM produces better code if truncating to i1
first, but it's still waay worse than packed-simd
:
https://llvm.godbolt.org/z/3v71P9Wdo
reduce_or:
vorr d16, d1, d0
vzip.8 d16, d17
vorr d16, d17, d16
vmov.u16 r0, d16[0]
vmov.32 d17[0], r0
vmov.u16 r0, d16[1]
vmov.32 d17[1], r0
vmov.u16 r0, d16[2]
vmov.32 d18[0], r0
vmov.u16 r0, d16[3]
vmov.32 d18[1], r0
vorr d16, d18, d17
vmov.32 r0, d16[0]
vmov.32 r1, d16[1]
orr r0, r1, r0
and r0, r0, #1
bx lr
It would work for this particular target, but not for any other arm target that doesn't have neon baked in (even if using the target_feature attribute or compiler flag).
The thumbv7neon-*
targets exist precisely to address this problem: To allow things to perform better than after-the-fact +neon
by baking NEON in at std
compile time.
We actually use the reduce-or intrinsic which does exceptionally poorly: https://llvm.godbolt.org/z/aEha7dh5b
It would work for this particular target, but not for any other arm target that doesn't have neon baked in (even if using the target_feature attribute or compiler flag).
The
thumbv7neon-*
targets exist precisely to address this problem: To allow things to perform better than after-the-fact+neon
by baking NEON in atstd
compile time.
Yes, it does, but that doesn't help for anyone who wants to runtime check for neon. It looks to me like this is a case that should be built into the LLVM reduce-or intrinsic for i1
. We could partially solve it by bypassing codegen with explicit neon intrinsics, but the subpar LLVM codegen is the root of the problem.
Yes, it does, but that doesn't help for anyone who wants to runtime check for neon. It looks to me like this is a case that should be built into the LLVM reduce-or intrinsic for
i1
.
Ideally, yes, but in the interim, it would be great to have a core_simd
level conditionally-compiled implementation the produces the same instructions as the same reduction in simd
and in packed_simd
.
That's a possibility, but I'm concerned about special casing every codegen issue for every target (this is certainly not the only one) and having a heap of target-specific code getting in the way of making stdsimd actually portable and stable. At a minimum we should also report this to LLVM (it may be relevant to more targets than just armv7)
I agree that reporting to LLVM makes sense and that it's problematic to have a lot of conditional target-specific code. However, another point of view is that if the application programmer can't trust that core::simd
compiles to the best instructions, the value proposition for portable SIMD in general suffers. I think it's less bad to put the target-specific code in core::simd
than e.g. in encoding_rs
.
I would argue that even with the sometimes-enabled target specific code, it still can't be trusted to compile to the best instructions. Regardless, I opened https://bugs.llvm.org/show_bug.cgi?id=51122. It appears there was a similar issue on x86-64 a couple years ago, but that was fixed.
I was able to fix the codegen on Aarch64 by implementing any
as self.to_int().horizontal_min() == -1
, but this still fails on ARM and produces:
reduce_or:
push {r4, r5, r6, r7, r11, lr}
vld1.64 {d16, d17}, [r0]
vmin.s8 d16, d16, d17
vmov.u8 r0, d16[0]
vmov.u8 r1, d16[1]
vmov.u8 r2, d16[2]
vmov.u8 r3, d16[3]
vmov.u8 r12, d16[4]
vmov.u8 lr, d16[5]
vmov.u8 r4, d16[6]
vmov.u8 r5, d16[7]
lsl r0, r0, #24
lsl r1, r1, #24
asr r6, r0, #24
asr r7, r1, #24
cmp r6, r1, asr, #24
asrlt r7, r0, #24
lsl r0, r2, #24
cmp r7, r0, asr, #24
asrge r7, r0, #24
lsl r0, r3, #24
cmp r7, r0, asr, #24
asrge r7, r0, #24
lsl r0, r12, #24
cmp r7, r0, asr, #24
asrge r7, r0, #24
lsl r0, lr, #24
cmp r7, r0, asr, #24
asrge r7, r0, #24
lsl r0, r4, #24
cmp r7, r0, asr, #24
asrge r7, r0, #24
lsl r0, r5, #24
cmp r7, r0, asr, #24
asrge r7, r0, #24
and r0, r7, #255
sub r0, r0, #255
clz r0, r0
lsr r0, r0, #5
pop {r4, r5, r6, r7, r11, pc}
LLVM does not seem to be aware of vpmin
at all.
We changed some stuff around and LLVM is now LLVM 13. No action on the bug, but is this still an issue today?
I just checked godbolt and yes, it's still an issue (for both armv7 and aarch64).
We've been concerned about this kind of thing for a while, it's... part of what the whole mask representation arguments were about near the start (and this looks somewhat like my examples of LLVM doing bad things from back then 😬).
While I'm unsure we need to be above a target-specific fix (libcore even has these in the scalar numerics, after all), it sounds like this is not just an arm problem... (the indication I'm getting is that this is bad on several targets)
Boolean reductions are a super common operation. Do... we actually have a feeling this is fixable? (Perhaps with a new intrinsic which is UB to use if the input is not a mask type?)
The issue doesn't really have anything to do with our mask representation, but LLVM not lowering well. The x86 backend has special logic for boolean reductions, and either that logic needs to be made generic, or ported to all targets.
The upstream of this issue is now tracked at https://github.com/llvm/llvm-project/issues/50466
Hrm. I think I would to know slightly more about how our highly generic code gets included in crates from std, if at all, since we mark as many things #[inline]
as possible, before addressing this. The code generated is obviously bad, I am just wary about going off only one test example. And one alternative we haven't discussed is the possibility of tweaking the LLVM IR we emit more directly.
If we did solve it here, we might have to do so as a specialization on a length...
Mmm, I did some investigation and I have begun to doubt that it's worth fixing this in std::simd
before codegen, precisely because this IS a super common operation.
You see, miri attempts to run std's tests, so miri needs to be able to manage what we do in tests. We could do an end-run around miri and give more explicit directions in LLVM lowering, but we shouldn't really bother at the library level. And then whatever fix we implemented by manipulating LLVMIR would be better off upstreamed.
FWIW, this wouldn't be the first code in the stdlib that does #[cfg(miri)] to perform an optimization that miri doesn't support (or even does support, except under stricter MIRIFLAGS).
It's also not really that much worse than tweaks made in other places in libcore -- for example, https://github.com/rust-lang/rust/blob/master/library/core/src/num/uint_macros.rs#L1645-L1657 was tuned to match what some specific LLVM vectorization pattern looks for.
IOW I'm not sure there's any reason to favor purity over practicality here... People expect the stdlib to contain the gross hacks so they don't need to, especially for common operations (but also less common ones that are big enough wins, like with that abs_diff).
The abs_diff
example there doesn't seem to be related at all as it doesn't use cfg.
I am not exactly clear on why I would be "favoring purity over practicality" when I suggest directly altering the LLVMIR lowering of the intrinsic on some targets in order to get the desired results when it genuinely has begun to seem simpler than a mash of cfg at the library level which complicates life for everyone who lives upstream of LLVM's competence or lack thereof. I was not aware "tweak the LLVMIR conditionally in response to some compilation states" was a pure operation. It seems rather stateful to me. Is the compiler a monad?
The problem is that most of the solutions I have looked at seem to depend intensely on foreknowledge of the exact types and lengths involved, or on language-level capabilities like specialization, or on exhaustively expressing almost identical functions for every possible composition of types and lengths. This quickly risks exponential blowup in pages of code for one improvement.
Perhaps I made it sound like I intend to defer a solution indefinitely? On the contrary, I arrived at this conclusion after learning considerably more on how to generate LLVMIR and coming to the conclusion that directly complicating one of the simd_reduce
intrinsics would be shorter and more maintainable.
Well, the abs_diff is an x86-specific thing, and plausibly should have been cfg
ed that way.
But more broadly: Fair enough -- Historically "this is an LLVM codegen bug" tends to imply things won't get fixed until an LLVM dev cares enough to fix it on their end.
I mean, LLVM should fix it, but on our end, we can emit a sequence of llvm.aarch64.neon.{s,u}minp.v{4,8,16}.i{8,16,32}
intrinsics when we recognize that the compilation target is aarch64
and we have an applicable vector. Likewise with Armv7 if Neon is enabled, etc.
Someone else will probably have to upstream it, though. I can program in JavaScript, Rust, x86 Assembly, and LLVMIR, not C++.
As always, compile time detection is only somewhat useful as many applications want runtime detection.
That's true. I'll take the partial win, though.
I fixed this in https://github.com/llvm/llvm-project/commit/71dc3de533b9247223c083a3b058859c9759099c via rust-lang/rust#114048 but oddly enough, the optimization only occurs without -Copt-level=3
it looks like the optimizer is converting the reductions to a bitcast and compare, so to get it to work you'll need to special-case that too in instruction selection (or wherever else in the Arm backend)
Ah, you're right, I thought I made that change too but it was only for wasm.
I fixed this in llvm/llvm-project@71dc3de via rust-lang/rust#114048 but oddly enough, the optimization only occurs without
-Copt-level=3
Looks like the bad codegen also occurs with lower opt-levels.
Since packed_simd
doesn't work with rustc
anymore, I need to migrate encoding_rs
in Firefox to core::simd
. To enable that without regressing performance, I published a workaround crate for this bug. (It's great that apart from this bug core::simd
works. Thank you!)
Using
cargo build --release --target thumbv7neon-unknown-linux-gnueabihf
:With
packed_simd_2
, this:compiles to this:
With
core_simd
, this:compiles to this:
Additional info
This seriously regresses performance (
packed_simd_2
vs.core_simd
) forencoding_rs
.Previously when migrating from
simd
topacked_simd
.Meta
rustc --version --verbose
:stdsimd
rev 715f9ac4e36ee303c3d464121ebb65df8f92416e