llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
29.12k stars 12.01k forks source link

Missed Optimization - Replacement of rint/lrint with X87/SSE specific instructions #55202

Open Hendiadyoin1 opened 2 years ago

Hendiadyoin1 commented 2 years ago

X87 and SSE have simple rounding and converting store instructions, which are essentially equivalent to l{0,2}rint[fl]?

Clang/LLVM does not seem to replace calls to rint with these, and neither does it vectorise these when used to round/convert vectors in all cases. (truncation is properly replaced)

Some examples follow below

GCC is listed aswell, The main difference to them is, that they do schedule their fldcw for truncation earlier and replace rintl, as well as use some bit-magic for rintf

Note: Using f32x4 for float __vector(4) and i32x4 for int __vector(4) Note: cvtss2si != cvttss2si Note: Assuming Overflows etc are UB, and HW's behaviour is acceptable

Scenario LLVM GCC Effective instruction(s)
rintl call rintl@PLT frndint frndint
(int)rintl call rintl@PLT +truncation call rintl@PLT +truncation fistp m16/m32/m64
lrintl call lrintl call lrintl fistp m16/m32/m64
lrint call lrintl call lrintl cvtss2si r32/r64, xmmX
(int)rintf call rintf@PLT;cvttss2si Bit magic+cvttss2si cvtss2si r32, xmmX
(int)rintf (SSE4.2) roundss + cvttss2si roundss + cvttss2si cvtss2si r32, xmmX
4x lrintf (f32x4->i32x4) 4x (shuffle+call lrintl) 4x (shuffle+call lrintl cvtps2dq xmmY, xmmX

Tested using glodbolt and x86_64 Clang 14.0.0 as well as x86_64 GCC 11.2 with O2 and O3


Update: Seems like most cases are now cought, only coalecsing cvtss2sis to cvtps2dq

llvmbot commented 2 years ago

@llvm/issue-subscribers-backend-x86

efriedma-quic commented 2 years ago

Brief notes:

  1. Some additional optimizations are enabled here by -fno-math-errno. We could probably be more aggressive here by default.
  2. LLVM doesn't currently combine (int)rint(f) -> lrint(f).
  3. LLVM doesn't currently have any patterns to generate frndint.
  4. The "bit magic" gcc does is mostly a floating-point trick. The rounding step of a floating-point "add" can be used to round a number. The rest is just ensuring the number is small, and doing the necessary sign bit manipulation.
Hendiadyoin1 commented 2 years ago

looking at the man page for rint and lrint, it seems that rint cannot fail in practice, while lrint does not seem to set errno, and can be assumed infallible with -ffinite-math-only and following the infallibility logic, patterns like "(int)floor" could be replaced by cw/mxcsr accesses and conversions instead of a call and a truncation, which would access the controll registers in a lot of cases anyway (or roundss+any-conversion)

Looking at the output with -fno-math-errno LLVM and GCC seem replace lrint correctly with cvtss2si, but fails to vecotrize it.

Edit: The lrint man page seems quite weird: It states:

See math_error(7) for information on how to determine whether an error has >occurred when calling these functions.

The following errors can occur:

Domain error: x is a NaN or infinite, or the rounded value is too large An invalid floating-point exception (FE_INVALID) is raised. These functions do not set errno.

So it says it can raise an exception and cause a domnain-error, but not touch errno, although domain errors are signaled in errno?

lygstate commented 1 year ago

ping for this, mesa want this at https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/19978

enh-google commented 1 year ago

Edit: The lrint man page seems quite weird: It states:

See math_error(7) for information on how to determine whether an error has >occurred when calling these functions. The following errors can occur: Domain error: x is a NaN or infinite, or the rounded value is too large An invalid floating-point exception (FE_INVALID) is raised. These functions do not set errno.

So it says it can raise an exception and cause a domnain-error, but not touch errno, although domain errors are signaled in errno?

not for these functions, no. the FE_INVALID is the clue here --- it's telling you that you need to use the functionality in to recognize errors from these functions. (as opposed to the nearbyint() family, which can't report errors via any mechanism.)

and note also that the C standard doesn't guarantee any floating point exceptions here either --- the exact wording in the standard is "The rint functions differ from the nearbyint functions (7.12.9.3) only in that the rint functions may raise the "inexact" floating-point exception if the result differs in value from the argument." (emphasis mine)

-*-

anyway, i'm here as the maintainer of Android's libc ("bionic"), having been pointed at this bug after i asked our llvm folks if they knew why -- unlike arm64 and risc-v --__builtin_lrint and __builtin_lrintf aren't implemented for x86/x86-64, despite the fact that they're both single instructions. (Android's LP64 ABIs have ld128, so there's no optimization for lrintl() for us.)

"i'd use this" (well, the __builtin_s anyway) :-)

nickdesaulniers commented 1 year ago

LLVM doesn't currently combine (int)rint(f) -> lrint(f).

Sounds like simplifylibcalls could use some additions here? cc @davidbolvansky

So it looks like clang (the front end) is lowering calls to __builtin_lrint to either @llvm.lrint.i64.f64 (as expected) or @lrint based on -fno-math-errno? Looking at https://cs.android.com/android/platform/superproject/+/master:bionic/libm/builtins.cpp;l=69?q=lrint&ss=android%2Fplatform%2Fsuperproject:bionic%2Flibm%2F, @enh-google does bionic build libm with -fno-math-errno (or some flag that implicitly sets that) for __aarch64__ and __riscv but not __x86_64__/__i386__? It does look like LLVM can lower those intrinsics on x86, when the frontend does generate them. https://github.com/llvm/llvm-project/blob/main/llvm/test/CodeGen/X86/lrint-conv-i32.ll

cc @phoebewang

stephenhines commented 1 year ago

https://cs.android.com/android/platform/superproject/+/master:bionic/libm/Android.bp;l=432 shows that bionic does build with -fno-math-errno for all targets. It might be good to create a small reproducer in the Android build system (in bionic/ itself) to make sure that there's not another flag interfering with this somehow, or just that the places where this would be expected to show up aren't doing something that interferes with @llvm.lrint.i64.f64 being selected.

enh-google commented 4 months ago

sorry, missed that there was a question for me...

yes, bionic does explicitly set -fno-math-errno when building libm.

yes, it does that unconditionally for all architectures.

(and note that we have a bunch of other x86/x86-64 builtins that work fine in https://cs.android.com/android/platform/superproject/+/main:bionic/libm/builtins.cpp.)

does anyone have a godbolt link where this actually works? i haven't been able to come up with one. though

double r(double x) { return __builtin_rint(x); }
float rf(float x) { return __builtin_rintf(x); }
long double fl(long double x) { return __builtin_rintl(x); }

does work in gcc. but trunk clang as of godbolt today still emits function calls.

@appujee

efriedma-quic commented 4 months ago

https://godbolt.org/z/4KsKv9f69 shows clang generating roundss/roundsd, if that's what you're asking.

Hendiadyoin1 commented 4 months ago

Checked also again with lrint which successfully generates a single cvtsd2si Sadly doing so for a vector does not seem to be coalesced, so reducing the scope https://godbolt.org/z/Pofx7eP4G Casts/truncation seems to be properly cought though

enh-google commented 4 months ago

https://godbolt.org/z/4KsKv9f69 shows clang generating roundss/roundsd, if that's what you're asking.

ah, thanks! so it looks like what we're actually missing is -msse4, since that's not in the Android x86 ABI. does SSE4.1 or SSE4.2 not imply SSE4 though? i'm surprised from https://developer.android.com/ndk/guides/abis#sa that this doesn't "just work" on x86-64... certainly changing your -msse4 to -msse4.1 still seems to work. and our build system looks like it actually explicitly says -msse4 -msse4.1 -msse4.2, so i'll have to have a deeper look there and work out what's going on...

efriedma-quic commented 4 months ago

Sadly doing so for a vector does not seem to be coalesced, so reducing the scope https://godbolt.org/z/Pofx7eP4G

I think this is just SLP not detecting the pattern and transforming it to a vector operation; you might want to file a separate bug specifically about SLP.

phoebewang commented 4 months ago

I have improved the lowering of vector lrint/llrint by https://github.com/llvm/llvm-project/pull/90065. I think the x86 specific issue can be closed.

appujee commented 4 months ago

For android this optimization is still missed. https://godbolt.org/z/xKEfeq97E

$ clang -O2 -ffast-math

# After x86-isel
#Machine code for function fl1(double): IsSSA, TracksLiveness
Frame Objects:
  fi#0: size=8, align=8, at location [SP+8]
  fi#1: size=8, align=8, at location [SP+8]
Function Live Ins: $xmm0 in %0

bb.0.entry:
  liveins: $xmm0
  %0:fr64 = COPY $xmm0
  MOVSDmr %stack.1, 1, $noreg, 0, $noreg, %0:fr64 :: (store (s64) into %stack.1); example.cpp:1:46
  %1:rfp80 = nofpexcept LD_Fp64m80 %stack.1, 1, $noreg, 0, $noreg, implicit-def dead $fpsw, implicit $fpcw :: (load (s64) from %stack.1); example.cpp:1:46
  IST_Fp64m80 %stack.0, 1, $noreg, 0, $noreg, killed %1:rfp80, implicit-def dead $fpsw, implicit $fpcw :: (store (s64) into %stack.0); example.cpp:1:29
  %2:gr64 = MOV64rm %stack.0, 1, $noreg, 0, $noreg :: (load (s64) from %stack.0); example.cpp:1:29
  $rax = COPY %2:gr64; example.cpp:1:22
  RET 0, $rax; example.cpp:1:22

# End machine code for function fl1(double).

clang -O2 -ffast-math --target=x86_64-linux-android

# After x86-isel
# Machine code for function fl1(double): IsSSA, TracksLiveness
Function Live Ins: $xmm0 in %0

bb.0.entry:
  liveins: $xmm0
  %0:fr64 = COPY $xmm0
  ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp; example.cpp:1:46
  $xmm0 = COPY %0:fr64; example.cpp:1:46
  CALL64pcrel32 target-flags(x86-plt) &__extenddftf2, <regmask $bh $bl $bp $bph $bpl $bx $ebp $ebx $hbp $hbx $rbp $rbx $r12 $r13 $r14 $r15 $r12b $r13b $r14b $r15b $r12bh $r13bh $r14bh $r15bh $r12d $r13d $r14d $r15d $r12w $r13w $r14w $r15w $r12wh and 3 more...>, implicit $rsp, implicit $ssp, implicit $xmm0, implicit-def $rsp, implicit-def $ssp, implicit-def $xmm0; example.cpp:1:46
  ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp; example.cpp:1:46
  %1:vr128 = COPY $xmm0; example.cpp:1:46
  $xmm0 = COPY %1:vr128; example.cpp:1:29
  TCRETURNdi64 target-flags(x86-plt) &lrintl, 0, <regmask $bh $bl $bp $bph $bpl $bx $ebp $ebx $hbp $hbx $rbp $rbx $r12 $r13 $r14 $r15 $r12b $r13b $r14b $r15b $r12bh $r13bh $r14bh $r15bh $r12d $r13d $r14d $r15d $r12w $r13w $r14w $r15w $r12wh and 3 more...>, implicit $rsp, implicit $ssp, implicit $xmm0; example.cpp:1:29

# End machine code for function fl1(double).
Hendiadyoin1 commented 4 months ago

I have improved the lowering of vector lrint/llrint by #90065. I think the x86 specific issue can be closed.

Sadly this seems to not be able to see through the truncation to 32bit in this example: https://godbolt.org/z/Pofx7eP4G with -m32 it works nicely though, great work

phoebewang commented 4 months ago

For https://godbolt.org/z/xKEfeq97E, the problem is long double interpreted as fp128 instead of fp80 on android. If you don't really need long double, you can use __builtin_lrint. For https://godbolt.org/z/Pofx7eP4G, the difference comes from IR, which would be SLP issue as @efriedma-quic suggested. Both are not x86 specific issues.

enh-google commented 4 months ago

yeah, no Android-supported architecture has hardware fp128 support, so that's expected. and i messed up in my previous comment --- copying directly from the bionic source now, what i'm trying to do is add __i386__ and __x86_64__ to

#if defined(__aarch64__) || defined(__riscv)
long lrint(double x) { return __builtin_lrint(x); }
long lrintf(float x) { return __builtin_lrintf(x); }
long long llrint(double x) { return __builtin_llrint(x); }
long long llrintf(float x) { return __builtin_llrintf(x); }
#endif

and the reason it doesn't work (found by @pirama-arumuga-nainar yesterday) is that we -include a header containing #pragma STDC FENV_ACCESS ON before this. if i hack in a quick #pragma STDC FENV_ACCESS OFF into this file, i see the same [inlined single instruction] results from the [much] earlier godbolt link.

...but unless i misunderstood https://www.felixcloutier.com/x86/cvtsd2si "When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR register" sounds like this instruction does take into account (just like the corresponding arm64 and riscv64 instructions), so it seems like a bug that #pragma STDC FENV_ACCESS ON disables use of it for x86/x86-64?

phoebewang commented 4 months ago

It's not a bug. When #pragma STDC FENV_ACCESS ON, the front end will generate llvm.experimental.constrained.lrint. We just haven't handled the STRICT_LRINT nodes yet.