Closed l0calh05t closed 2 years ago
I agree, but why the xor
with u32::MAX
instead of a not
or following ARMs docs and using operand1 EOR ((operand1 EOR operand4) AND operand3);
, which also optimized to the expected code when I used it as a workaround in my code?
Edit: ok it's simply what clang does on aarch64, although I remain unclear on their reasoning
What is the reason to implement the intrinsic via simd_ and not via a direct llvm intrinsic? At least that's what I expected to happen. With the current vbsl implementation, my code runs ~11% slower compared to using vbsl (via inline assembler).
What code do I need to submit to fix this issue?
I'm also surprised that the assert_instr(bsl)
passed through CI.
Mainly is to be consistent with Clang's behavior, which is to take full advantage of llvm's automatic vector optimization.
What code do I need to submit to fix this issue?
We should modify the implementation of vbsl*
in core_arch/src/arm_shared/neon/mod.rs
and core_arch/src/aarch64/neon/mod.rs
. It can be replaced with a combination of platfrom-intrinsics like simd_and
and simd_xor
, or we can directly link to the intrinsics in llvm using #[link_name = "llvm.*"]
Which one is preferred? How can I make sure that the assert_instr
tests actually run?
The priority here is to match Clang's behavior and generate the same LLVM IR. The assert_instr
tests are mainly used as a sanity check, not a hard requirement. This is because the intrinsics don't actually guarantee any specific instruction, you have to use inline assembly if you depend on that.
I realized why the codegen checks did not fail:
The generated code contains the required instruction, but it also contains additional instructions that we don't want, because they change the semantics of the intrinsic:
0000000100086f2c <_stdarch_test_shim_vbslq_f32_bsl>:
100086f2c: 00 54 3f 4f shl.4s v0, v0, #31
100086f30: 00 a8 a0 4e cmlt.4s v0, v0, #0
100086f34: 20 1c 62 6e bsl.16b v0, v1, v2
100086f38: c0 03 5f d6 ret
Should we add a negative assertion option ("does not contain shl"), or should we just add a test for the correct behavior?
What do you think about writing the vbsl*
intrinsics this way:
pub unsafe fn vbsl_s8(a: uint8x8_t, b: int8x8_t, c: int8x8_t) -> int8x8_t {
let not = int8x8_t(-1, -1, -1, -1, -1, -1, -1, -1);
transmute(simd_or(
simd_and(a, transmute(b)),
simd_and(simd_xor(a, transmute(not)), transmute(c)),
))
}
This matches the clang codegen.
Currently the unit tests only use ::MAX
and ::MIN
for the mask and therefore do not distinguish between a lane select bit and a true bit select. How should we change the tests? Using some other integers that do or do not have the lane select bit set do reveal the current bug, but look a bit ad-hoc:
#[simd_test(enable = "neon")]
unsafe fn test_vbsl_s16() {
let a = u16x4::new(u16::MAX, 0, 1, 2);
let b = i16x4::new(i16::MAX, i16::MAX, i16::MAX, i16::MAX);
let c = i16x4::new(i16::MIN, i16::MIN, i16::MIN, i16::MIN);
let e = i16x4::new(i16::MAX, i16::MIN, i16::MIN | 1, i16::MIN | 2);
let r: i16x4 = transmute(vbsl_s16(transmute(a), transmute(b), transmute(c)));
assert_eq!(r, e);
}
Testing float bit selects seems difficult in any case.
Looks good!
For the instruction test you can disable it by setting the instruction to nop
. I think it's more hassle than it's worth in this case.
I tried this code:
using this Cargo configuration
I expected to see this happen: the generated code includes a
vbsl
/vbit
/vbif
instruction, i.e., like Clang's output for an equivalent C functionInstead, this happened: The function is optimized to returning
to
:We discussed this issue on Zulip, and it appears that all NEON
vbsl*_*
intrinsics are implemented usingsimd_select
which does lane selection instead of bitwise selection. The issue affects bothaarch64
andarmv7
targets.Meta
rustc --version --verbose
:Backtrace
``` PS D:\development\neon-test> $env:RUST_BACKTRACE="1" PS D:\development\neon-test> cargo build --release Compiling neon-test v0.1.0 (D:\development\neon-test) Finished release [optimized] target(s) in 0.65s ```