llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
29.02k stars 11.96k forks source link

Clang LLVM emitting float instrinsics for bitwise operation on integer inputs #57499

Open matouskozak opened 2 years ago

matouskozak commented 2 years ago

When executing bitwise operations on Vector128<int> in Runtime Mono on x86-64 CPU, LLVM is emitting single-precision floating-point instruction. Example for logical bitwise AND:

This is the LLVM IR which is generated:

#0 {
BB0:
  %0 = and <4 x i32> %arg_vectorA, %arg_vectorB
  %1 = inttoptr i64 %vret to <4 x i32>*
  store <4 x i32> %0, <4 x i32>* %1, align 1
  ret void
}

And the assembly from this:

<BB>:1
   0:   c5 f8 54 c1             vandps %xmm1,%xmm0,%xmm0
   4:   c5 f8 11 07             vmovups %xmm0,(%rdi)
   8:   c3                      retq

Could someone please explain me why there is a vandps and not vpand? Thank you

llvmbot commented 2 years ago

@llvm/issue-subscribers-clang-codegen

topperc commented 2 years ago

Largely historical.

The SSE encoding for andps is 1 byte shorter than pand. I’m not sure if there is a difference for AVX. Depends on which can use the 2 byte VEX prefix.

It also used to be that there were more execution units available for andps than pand on Intel CPUs. I don’t think that’s true on modern CPUs.

These were the two main reasons to convert int to FP.

llvmbot commented 2 years ago

@llvm/issue-subscribers-backend-x86

jandupej commented 2 years ago

Addressing this may provide a minor performance advantage if there is a long dependency chain. ICL/TGL has a penalty of one clock when using the result of a vector int operation in a vector float operation. Zen3 has that penalty when transitioning either way (FP to int and vice versa). Agner Fog's SW Optimization resources

topperc commented 2 years ago

We should only be doing the conversion if we can convert the surrounding instructions too. That’s why we also used an FP store.