Bad codegen for boolean reductions on thumbv7neon

hsivonen commented 3 years ago

Using cargo build --release --target thumbv7neon-unknown-linux-gnueabihf:

With packed_simd_2, this:

#[no_mangle]
#[inline(never)]
fn packed_any(s: m8x16) -> bool {
    s.any()
}

compiles to this:

00003e66 <packed_any>:
    3e66:       f960 0acf       vld1.64 {d16-d17}, [r0]
    3e6a:       ff40 0aa1       vpmax.u8        d16, d16, d17
    3e6e:       ec51 0b30       vmov    r0, r1, d16
    3e72:       4308            orrs    r0, r1
    3e74:       bf18            it      ne
    3e76:       2001            movne   r0, #1
    3e78:       4770            bx      lr
        ...

With core_simd, this:

#[no_mangle]
#[inline(never)]
fn core_any(s: mask8x16) -> bool {
    s.any()
}

compiles to this:

00003e66 <core_any>:
    3e66:       b5f0            push    {r4, r5, r6, r7, lr}
    3e68:       f960 0acf       vld1.64 {d16-d17}, [r0]
    3e6c:       eed0 0bb0       vmov.u8 r0, d16[1]
    3e70:       eed0 1b90       vmov.u8 r1, d16[0]
    3e74:       eed0 2bd0       vmov.u8 r2, d16[2]
    3e78:       eed0 3bf0       vmov.u8 r3, d16[3]
    3e7c:       eef0 cb90       vmov.u8 ip, d16[4]
    3e80:       eef0 ebb0       vmov.u8 lr, d16[5]
    3e84:       eef0 4bd0       vmov.u8 r4, d16[6]
    3e88:       eef0 7bf0       vmov.u8 r7, d16[7]
    3e8c:       eed1 5bf0       vmov.u8 r5, d17[3]
    3e90:       eef1 6b90       vmov.u8 r6, d17[4]
    3e94:       4308            orrs    r0, r1
    3e96:       eed1 1b90       vmov.u8 r1, d17[0]
    3e9a:       4310            orrs    r0, r2
    3e9c:       eed1 2bb0       vmov.u8 r2, d17[1]
    3ea0:       4318            orrs    r0, r3
    3ea2:       eed1 3bd0       vmov.u8 r3, d17[2]
    3ea6:       ea40 000c       orr.w   r0, r0, ip
    3eaa:       ea40 000e       orr.w   r0, r0, lr
    3eae:       4320            orrs    r0, r4
    3eb0:       eef1 4bb0       vmov.u8 r4, d17[5]
    3eb4:       4338            orrs    r0, r7
    3eb6:       eef1 7bd0       vmov.u8 r7, d17[6]
    3eba:       4308            orrs    r0, r1
    3ebc:       eef1 1bf0       vmov.u8 r1, d17[7]
    3ec0:       4310            orrs    r0, r2
    3ec2:       4318            orrs    r0, r3
    3ec4:       4328            orrs    r0, r5
    3ec6:       4330            orrs    r0, r6
    3ec8:       4320            orrs    r0, r4
    3eca:       4338            orrs    r0, r7
    3ecc:       4308            orrs    r0, r1
    3ece:       f000 0001       and.w   r0, r0, #1
    3ed2:       bdf0            pop     {r4, r5, r6, r7, pc}

Additional info

This seriously regresses performance (packed_simd_2 vs. core_simd) for encoding_rs.

Previously when migrating from simd to packed_simd.

Meta

rustc --version --verbose:

rustc 1.55.0-nightly (b1f8e27b7 2021-07-15)
binary: rustc
commit-hash: b1f8e27b74c541d3d555149c8efa4bfe9385cd56
commit-date: 2021-07-15
host: armv7-unknown-linux-gnueabihf
release: 1.55.0-nightly
LLVM version: 12.0.1

stdsimd rev 715f9ac4e36ee303c3d464121ebb65df8f92416e

calebzulawski commented 3 years ago

packed-simd uses specific arm instructions which we wont be able to do (since target_feature doesn't apply to std). stdsimd uses an integer "or" reduction, so unfortunately LLVM doesn't know that the vector must contain only 0 or -1. Perhaps we can add an intrinsic that does a truncation before the reduction and hope that it is smart enough to produce similar code.

hsivonen commented 3 years ago

Doesn't target_feature apply to std even when compiling std? I expected it to work when NEON-availability is part of the target as with thumbv7neon-unknown-linux-gnueabihf and not an application compile-time option.

calebzulawski commented 3 years ago

Good point--i missed the neon in the target. It would work for this particular target, but not for any other arm target that doesn't have neon baked in (even if using the target_feature attribute or compiler flag). I'm guessing other architectures likely see a similar codegen issue, as well.

programmerjake commented 3 years ago

packed-simd uses specific arm instructions which we wont be able to do (since target_feature doesn't apply to std). stdsimd uses an integer "or" reduction, so unfortunately LLVM doesn't know that the vector must contain only 0 or -1. Perhaps we can add an intrinsic that does a truncation before the reduction and hope that it is smart enough to produce similar code.

LLVM produces better code if truncating to i1 first, but it's still waay worse than packed-simd: https://llvm.godbolt.org/z/3v71P9Wdo

reduce_or:
        vorr    d16, d1, d0
        vzip.8  d16, d17
        vorr    d16, d17, d16
        vmov.u16        r0, d16[0]
        vmov.32 d17[0], r0
        vmov.u16        r0, d16[1]
        vmov.32 d17[1], r0
        vmov.u16        r0, d16[2]
        vmov.32 d18[0], r0
        vmov.u16        r0, d16[3]
        vmov.32 d18[1], r0
        vorr    d16, d18, d17
        vmov.32 r0, d16[0]
        vmov.32 r1, d16[1]
        orr     r0, r1, r0
        and     r0, r0, #1
        bx      lr

hsivonen commented 3 years ago

It would work for this particular target, but not for any other arm target that doesn't have neon baked in (even if using the target_feature attribute or compiler flag).

The thumbv7neon-* targets exist precisely to address this problem: To allow things to perform better than after-the-fact +neon by baking NEON in at std compile time.

calebzulawski commented 3 years ago

We actually use the reduce-or intrinsic which does exceptionally poorly: https://llvm.godbolt.org/z/aEha7dh5b

It would work for this particular target, but not for any other arm target that doesn't have neon baked in (even if using the target_feature attribute or compiler flag).

The thumbv7neon-* targets exist precisely to address this problem: To allow things to perform better than after-the-fact +neon by baking NEON in at std compile time.

Yes, it does, but that doesn't help for anyone who wants to runtime check for neon. It looks to me like this is a case that should be built into the LLVM reduce-or intrinsic for i1. We could partially solve it by bypassing codegen with explicit neon intrinsics, but the subpar LLVM codegen is the root of the problem.

hsivonen commented 3 years ago

Yes, it does, but that doesn't help for anyone who wants to runtime check for neon. It looks to me like this is a case that should be built into the LLVM reduce-or intrinsic for i1.

Ideally, yes, but in the interim, it would be great to have a core_simd level conditionally-compiled implementation the produces the same instructions as the same reduction in simd and in packed_simd.

calebzulawski commented 3 years ago

That's a possibility, but I'm concerned about special casing every codegen issue for every target (this is certainly not the only one) and having a heap of target-specific code getting in the way of making stdsimd actually portable and stable. At a minimum we should also report this to LLVM (it may be relevant to more targets than just armv7)

hsivonen commented 3 years ago

I agree that reporting to LLVM makes sense and that it's problematic to have a lot of conditional target-specific code. However, another point of view is that if the application programmer can't trust that core::simd compiles to the best instructions, the value proposition for portable SIMD in general suffers. I think it's less bad to put the target-specific code in core::simd than e.g. in encoding_rs.

calebzulawski commented 3 years ago

I would argue that even with the sometimes-enabled target specific code, it still can't be trusted to compile to the best instructions. Regardless, I opened https://bugs.llvm.org/show_bug.cgi?id=51122. It appears there was a similar issue on x86-64 a couple years ago, but that was fixed.

calebzulawski commented 3 years ago

I was able to fix the codegen on Aarch64 by implementing any as self.to_int().horizontal_min() == -1, but this still fails on ARM and produces:

reduce_or:
 push    {r4, r5, r6, r7, r11, lr}
 vld1.64 {d16, d17}, [r0]
 vmin.s8 d16, d16, d17
 vmov.u8 r0, d16[0]
 vmov.u8 r1, d16[1]
 vmov.u8 r2, d16[2]
 vmov.u8 r3, d16[3]
 vmov.u8 r12, d16[4]
 vmov.u8 lr, d16[5]
 vmov.u8 r4, d16[6]
 vmov.u8 r5, d16[7]
 lsl     r0, r0, #24
 lsl     r1, r1, #24
 asr     r6, r0, #24
 asr     r7, r1, #24
 cmp     r6, r1, asr, #24
 asrlt   r7, r0, #24
 lsl     r0, r2, #24
 cmp     r7, r0, asr, #24
 asrge   r7, r0, #24
 lsl     r0, r3, #24
 cmp     r7, r0, asr, #24
 asrge   r7, r0, #24
 lsl     r0, r12, #24
 cmp     r7, r0, asr, #24
 asrge   r7, r0, #24
 lsl     r0, lr, #24
 cmp     r7, r0, asr, #24
 asrge   r7, r0, #24
 lsl     r0, r4, #24
 cmp     r7, r0, asr, #24
 asrge   r7, r0, #24
 lsl     r0, r5, #24
 cmp     r7, r0, asr, #24
 asrge   r7, r0, #24
 and     r0, r7, #255
 sub     r0, r0, #255
 clz     r0, r0
 lsr     r0, r0, #5
 pop     {r4, r5, r6, r7, r11, pc}

LLVM does not seem to be aware of vpmin at all.

workingjubilee commented 3 years ago

We changed some stuff around and LLVM is now LLVM 13. No action on the bug, but is this still an issue today?

calebzulawski commented 2 years ago

I just checked godbolt and yes, it's still an issue (for both armv7 and aarch64).

thomcc commented 2 years ago

We've been concerned about this kind of thing for a while, it's... part of what the whole mask representation arguments were about near the start (and this looks somewhat like my examples of LLVM doing bad things from back then 😬).

While I'm unsure we need to be above a target-specific fix (libcore even has these in the scalar numerics, after all), it sounds like this is not just an arm problem... (the indication I'm getting is that this is bad on several targets)

Boolean reductions are a super common operation. Do... we actually have a feeling this is fixable? (Perhaps with a new intrinsic which is UB to use if the input is not a mask type?)

calebzulawski commented 2 years ago

The issue doesn't really have anything to do with our mask representation, but LLVM not lowering well. The x86 backend has special logic for boolean reductions, and either that logic needs to be made generic, or ported to all targets.

workingjubilee commented 2 years ago

The upstream of this issue is now tracked at https://github.com/llvm/llvm-project/issues/50466

workingjubilee commented 2 years ago

Hrm. I think I would to know slightly more about how our highly generic code gets included in crates from std, if at all, since we mark as many things #[inline] as possible, before addressing this. The code generated is obviously bad, I am just wary about going off only one test example. And one alternative we haven't discussed is the possibility of tweaking the LLVM IR we emit more directly.

If we did solve it here, we might have to do so as a specialization on a length...

https://rustc.godbolt.org/z/9sEo9GPWY

workingjubilee commented 2 years ago

Mmm, I did some investigation and I have begun to doubt that it's worth fixing this in std::simd before codegen, precisely because this IS a super common operation.

You see, miri attempts to run std's tests, so miri needs to be able to manage what we do in tests. We could do an end-run around miri and give more explicit directions in LLVM lowering, but we shouldn't really bother at the library level. And then whatever fix we implemented by manipulating LLVMIR would be better off upstreamed.

thomcc commented 2 years ago

FWIW, this wouldn't be the first code in the stdlib that does #[cfg(miri)] to perform an optimization that miri doesn't support (or even does support, except under stricter MIRIFLAGS).

It's also not really that much worse than tweaks made in other places in libcore -- for example, https://github.com/rust-lang/rust/blob/master/library/core/src/num/uint_macros.rs#L1645-L1657 was tuned to match what some specific LLVM vectorization pattern looks for.

IOW I'm not sure there's any reason to favor purity over practicality here... People expect the stdlib to contain the gross hacks so they don't need to, especially for common operations (but also less common ones that are big enough wins, like with that abs_diff).

workingjubilee commented 2 years ago

The abs_diff example there doesn't seem to be related at all as it doesn't use cfg.

I am not exactly clear on why I would be "favoring purity over practicality" when I suggest directly altering the LLVMIR lowering of the intrinsic on some targets in order to get the desired results when it genuinely has begun to seem simpler than a mash of cfg at the library level which complicates life for everyone who lives upstream of LLVM's competence or lack thereof. I was not aware "tweak the LLVMIR conditionally in response to some compilation states" was a pure operation. It seems rather stateful to me. Is the compiler a monad?

The problem is that most of the solutions I have looked at seem to depend intensely on foreknowledge of the exact types and lengths involved, or on language-level capabilities like specialization, or on exhaustively expressing almost identical functions for every possible composition of types and lengths. This quickly risks exponential blowup in pages of code for one improvement.

Perhaps I made it sound like I intend to defer a solution indefinitely? On the contrary, I arrived at this conclusion after learning considerably more on how to generate LLVMIR and coming to the conclusion that directly complicating one of the simd_reduce intrinsics would be shorter and more maintainable.

thomcc commented 2 years ago

Well, the abs_diff is an x86-specific thing, and plausibly should have been cfged that way.

But more broadly: Fair enough -- Historically "this is an LLVM codegen bug" tends to imply things won't get fixed until an LLVM dev cares enough to fix it on their end.

workingjubilee commented 2 years ago

I mean, LLVM should fix it, but on our end, we can emit a sequence of llvm.aarch64.neon.{s,u}minp.v{4,8,16}.i{8,16,32} intrinsics when we recognize that the compilation target is aarch64 and we have an applicable vector. Likewise with Armv7 if Neon is enabled, etc.

Someone else will probably have to upstream it, though. I can program in JavaScript, Rust, x86 Assembly, and LLVMIR, not C++.

calebzulawski commented 2 years ago

As always, compile time detection is only somewhat useful as many applications want runtime detection.

workingjubilee commented 2 years ago

That's true. I'll take the partial win, though.

calebzulawski commented 1 year ago

I fixed this in https://github.com/llvm/llvm-project/commit/71dc3de533b9247223c083a3b058859c9759099c via rust-lang/rust#114048 but oddly enough, the optimization only occurs without -Copt-level=3

https://rustc.godbolt.org/z/a7njqsocr

programmerjake commented 1 year ago

https://rustc.godbolt.org/z/a7njqsocr

it looks like the optimizer is converting the reductions to a bitcast and compare, so to get it to work you'll need to special-case that too in instruction selection (or wherever else in the Arm backend)

calebzulawski commented 1 year ago

Ah, you're right, I thought I made that change too but it was only for wasm.

hsivonen commented 7 months ago

I fixed this in llvm/llvm-project@71dc3de via rust-lang/rust#114048 but oddly enough, the optimization only occurs without -Copt-level=3

https://rustc.godbolt.org/z/a7njqsocr

Looks like the bad codegen also occurs with lower opt-levels.

hsivonen commented 7 months ago

Since packed_simd doesn't work with rustc anymore, I need to migrate encoding_rs in Firefox to core::simd. To enable that without regressing performance, I published a workaround crate for this bug. (It's great that apart from this bug core::simd works. Thank you!)

rust-lang / portable-simd

Bad codegen for boolean reductions on thumbv7neon #146

Additional info

Meta