rust-lang / stdarch

Rust's standard library vendor-specific APIs and run-time feature detection
https://doc.rust-lang.org/stable/core/arch/
Apache License 2.0
605 stars 267 forks source link

Miscompilation of `vbsl*_*` NEON intrinsics #1191

Closed l0calh05t closed 2 years ago

l0calh05t commented 3 years ago

I tried this code:

#![feature(stdsimd)]

use std::arch::arm::*;
pub fn copysign(from: float32x4_t, to: float32x4_t) -> float32x4_t {
    unsafe {
        let mask = vdupq_n_u32(0x8000_0000);
        vbslq_f32(mask, from, to)
    }
}

using this Cargo configuration

[build]
target = "armv7-unknown-linux-gnueabihf"

[target.armv7-unknown-linux-gnueabihf]
linker = "arm-linux-gnueabihf-gcc"
rustflags = ["-Ctarget-feature=+neon"]

I expected to see this happen: the generated code includes a vbsl/vbit/vbif instruction, i.e., like Clang's output for an equivalent C function

copysign(__simd128_float32_t, __simd128_float32_t):
        vmov.i32        q8, #0x80000000
        vbif    q0, q1, q8
        bx      lr

Instead, this happened: The function is optimized to returning to:

00000000 <neon_test::copysign>:
   0:   f4620acf        vld1.64 {d16-d17}, [r2]
   4:   f4400acf        vst1.64 {d16-d17}, [r0]
   8:   e12fff1e        bx      lr

We discussed this issue on Zulip, and it appears that all NEON vbsl*_* intrinsics are implemented using simd_select which does lane selection instead of bitwise selection. The issue affects both aarch64 and armv7 targets.

Meta

rustc --version --verbose:

rustc 1.56.0-nightly (1f0a591b3 2021-07-30)
binary: rustc
commit-hash: 1f0a591b3a5963a0ab11a35dc525ad9d46f612e4
commit-date: 2021-07-30
host: x86_64-pc-windows-msvc
release: 1.56.0-nightly
LLVM version: 12.0.1
Backtrace

``` PS D:\development\neon-test> $env:RUST_BACKTRACE="1" PS D:\development\neon-test> cargo build --release Compiling neon-test v0.1.0 (D:\development\neon-test) Finished release [optimized] target(s) in 0.65s ```

SparrowLii commented 3 years ago

vbsl should be reimplemented. According to its implementation in aarch64 and arm in clang, it should use functions such as simd_and``simd_xor instead of simd_select.( for arm, that is same as linking to llvm.arm.neon.vbsl) Here is the result of copysign after being reimplemented.

l0calh05t commented 3 years ago

I agree, but why the xor with u32::MAX instead of a not or following ARMs docs and using operand1 EOR ((operand1 EOR operand4) AND operand3);, which also optimized to the expected code when I used it as a workaround in my code?

Edit: ok it's simply what clang does on aarch64, although I remain unclear on their reasoning

corecode commented 2 years ago

What is the reason to implement the intrinsic via simd_ and not via a direct llvm intrinsic? At least that's what I expected to happen. With the current vbsl implementation, my code runs ~11% slower compared to using vbsl (via inline assembler).

What code do I need to submit to fix this issue?

I'm also surprised that the assert_instr(bsl) passed through CI.

SparrowLii commented 2 years ago

Mainly is to be consistent with Clang's behavior, which is to take full advantage of llvm's automatic vector optimization.

What code do I need to submit to fix this issue?

We should modify the implementation of vbsl* in core_arch/src/arm_shared/neon/mod.rs and core_arch/src/aarch64/neon/mod.rs. It can be replaced with a combination of platfrom-intrinsics like simd_and and simd_xor, or we can directly link to the intrinsics in llvm using #[link_name = "llvm.*"]

corecode commented 2 years ago

Which one is preferred? How can I make sure that the assert_instr tests actually run?

Amanieu commented 2 years ago

The priority here is to match Clang's behavior and generate the same LLVM IR. The assert_instr tests are mainly used as a sanity check, not a hard requirement. This is because the intrinsics don't actually guarantee any specific instruction, you have to use inline assembly if you depend on that.

corecode commented 2 years ago

I realized why the codegen checks did not fail:

The generated code contains the required instruction, but it also contains additional instructions that we don't want, because they change the semantics of the intrinsic:

0000000100086f2c <_stdarch_test_shim_vbslq_f32_bsl>:
100086f2c: 00 54 3f 4f  shl.4s  v0, v0, #31
100086f30: 00 a8 a0 4e  cmlt.4s v0, v0, #0
100086f34: 20 1c 62 6e  bsl.16b v0, v1, v2
100086f38: c0 03 5f d6  ret

Should we add a negative assertion option ("does not contain shl"), or should we just add a test for the correct behavior?

corecode commented 2 years ago

What do you think about writing the vbsl* intrinsics this way:

pub unsafe fn vbsl_s8(a: uint8x8_t, b: int8x8_t, c: int8x8_t) -> int8x8_t {
    let not = int8x8_t(-1, -1, -1, -1, -1, -1, -1, -1);
    transmute(simd_or(
        simd_and(a, transmute(b)),
        simd_and(simd_xor(a, transmute(not)), transmute(c)),
    ))
}

This matches the clang codegen.

Currently the unit tests only use ::MAX and ::MIN for the mask and therefore do not distinguish between a lane select bit and a true bit select. How should we change the tests? Using some other integers that do or do not have the lane select bit set do reveal the current bug, but look a bit ad-hoc:

    #[simd_test(enable = "neon")]
    unsafe fn test_vbsl_s16() {
        let a = u16x4::new(u16::MAX, 0, 1, 2);
        let b = i16x4::new(i16::MAX, i16::MAX, i16::MAX, i16::MAX);
        let c = i16x4::new(i16::MIN, i16::MIN, i16::MIN, i16::MIN);
        let e = i16x4::new(i16::MAX, i16::MIN, i16::MIN | 1, i16::MIN | 2);
        let r: i16x4 = transmute(vbsl_s16(transmute(a), transmute(b), transmute(c)));
        assert_eq!(r, e);
    }

Testing float bit selects seems difficult in any case.

Amanieu commented 2 years ago

Looks good!

For the instruction test you can disable it by setting the instruction to nop. I think it's more hassle than it's worth in this case.