Open Hendiadyoin1 opened 2 years ago
@llvm/issue-subscribers-backend-x86
Brief notes:
looking at the man page for rint
and lrint
,
it seems that rint
cannot fail in practice, while lrint
does not seem to set errno, and can be assumed infallible with -ffinite-math-only
and following the infallibility logic, patterns like "(int)floor" could be replaced by cw/mxcsr accesses and conversions instead of a call and a truncation, which would access the controll registers in a lot of cases anyway (or roundss
+any-conversion)
Looking at the output with -fno-math-errno
LLVM and GCC seem replace lrint
correctly with cvtss2si
, but fails to vecotrize it.
Edit: The lrint man page seems quite weird: It states:
See math_error(7) for information on how to determine whether an error has >occurred when calling these functions.
The following errors can occur:
Domain error: x is a NaN or infinite, or the rounded value is too large An invalid floating-point exception (FE_INVALID) is raised. These functions do not set errno.
So it says it can raise an exception and cause a domnain-error, but not touch errno, although domain errors are signaled in errno?
ping for this, mesa want this at https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/19978
Edit: The lrint man page seems quite weird: It states:
See math_error(7) for information on how to determine whether an error has >occurred when calling these functions. The following errors can occur: Domain error: x is a NaN or infinite, or the rounded value is too large An invalid floating-point exception (FE_INVALID) is raised. These functions do not set errno.
So it says it can raise an exception and cause a domnain-error, but not touch errno, although domain errors are signaled in errno?
not for these functions, no. the FE_INVALID
is the clue here --- it's telling you that you need to use the functionality in nearbyint()
family, which can't report errors via any mechanism.)
and note also that the C standard doesn't guarantee any floating point exceptions here either --- the exact wording in the standard is "The rint functions differ from the nearbyint functions (7.12.9.3) only in that the rint functions may raise the "inexact" floating-point exception if the result differs in value from the argument." (emphasis mine)
-*-
anyway, i'm here as the maintainer of Android's libc ("bionic"), having been pointed at this bug after i asked our llvm folks if they knew why -- unlike arm64 and risc-v --__builtin_lrint
and __builtin_lrintf
aren't implemented for x86/x86-64, despite the fact that they're both single instructions. (Android's LP64 ABIs have ld128, so there's no optimization for lrintl()
for us.)
"i'd use this" (well, the __builtin_
s anyway) :-)
LLVM doesn't currently combine (int)rint(f) -> lrint(f).
Sounds like simplifylibcalls could use some additions here? cc @davidbolvansky
So it looks like clang (the front end) is lowering calls to __builtin_lrint
to either @llvm.lrint.i64.f64
(as expected) or @lrint
based on -fno-math-errno
? Looking at https://cs.android.com/android/platform/superproject/+/master:bionic/libm/builtins.cpp;l=69?q=lrint&ss=android%2Fplatform%2Fsuperproject:bionic%2Flibm%2F, @enh-google does bionic build libm with -fno-math-errno
(or some flag that implicitly sets that) for __aarch64__
and __riscv
but not __x86_64__
/__i386__
? It does look like LLVM can lower those intrinsics on x86, when the frontend does generate them.
https://github.com/llvm/llvm-project/blob/main/llvm/test/CodeGen/X86/lrint-conv-i32.ll
cc @phoebewang
https://cs.android.com/android/platform/superproject/+/master:bionic/libm/Android.bp;l=432 shows that bionic does build with -fno-math-errno
for all targets. It might be good to create a small reproducer in the Android build system (in bionic/
itself) to make sure that there's not another flag interfering with this somehow, or just that the places where this would be expected to show up aren't doing something that interferes with @llvm.lrint.i64.f64
being selected.
sorry, missed that there was a question for me...
yes, bionic does explicitly set -fno-math-errno
when building libm.
yes, it does that unconditionally for all architectures.
(and note that we have a bunch of other x86/x86-64 builtins that work fine in https://cs.android.com/android/platform/superproject/+/main:bionic/libm/builtins.cpp.)
does anyone have a godbolt link where this actually works? i haven't been able to come up with one. though
double r(double x) { return __builtin_rint(x); }
float rf(float x) { return __builtin_rintf(x); }
long double fl(long double x) { return __builtin_rintl(x); }
does work in gcc. but trunk clang as of godbolt today still emits function calls.
@appujee
https://godbolt.org/z/4KsKv9f69 shows clang generating roundss/roundsd, if that's what you're asking.
Checked also again with lrint
which successfully generates a single cvtsd2si
Sadly doing so for a vector does not seem to be coalesced, so reducing the scope
https://godbolt.org/z/Pofx7eP4G
Casts/truncation seems to be properly cought though
https://godbolt.org/z/4KsKv9f69 shows clang generating roundss/roundsd, if that's what you're asking.
ah, thanks! so it looks like what we're actually missing is -msse4
, since that's not in the Android x86 ABI. does SSE4.1 or SSE4.2 not imply SSE4 though? i'm surprised from https://developer.android.com/ndk/guides/abis#sa that this doesn't "just work" on x86-64... certainly changing your -msse4
to -msse4.1
still seems to work. and our build system looks like it actually explicitly says -msse4 -msse4.1 -msse4.2
, so i'll have to have a deeper look there and work out what's going on...
Sadly doing so for a vector does not seem to be coalesced, so reducing the scope https://godbolt.org/z/Pofx7eP4G
I think this is just SLP not detecting the pattern and transforming it to a vector operation; you might want to file a separate bug specifically about SLP.
I have improved the lowering of vector lrint/llrint by https://github.com/llvm/llvm-project/pull/90065. I think the x86 specific issue can be closed.
For android this optimization is still missed. https://godbolt.org/z/xKEfeq97E
$ clang -O2 -ffast-math
# After x86-isel
#Machine code for function fl1(double): IsSSA, TracksLiveness
Frame Objects:
fi#0: size=8, align=8, at location [SP+8]
fi#1: size=8, align=8, at location [SP+8]
Function Live Ins: $xmm0 in %0
bb.0.entry:
liveins: $xmm0
%0:fr64 = COPY $xmm0
MOVSDmr %stack.1, 1, $noreg, 0, $noreg, %0:fr64 :: (store (s64) into %stack.1); example.cpp:1:46
%1:rfp80 = nofpexcept LD_Fp64m80 %stack.1, 1, $noreg, 0, $noreg, implicit-def dead $fpsw, implicit $fpcw :: (load (s64) from %stack.1); example.cpp:1:46
IST_Fp64m80 %stack.0, 1, $noreg, 0, $noreg, killed %1:rfp80, implicit-def dead $fpsw, implicit $fpcw :: (store (s64) into %stack.0); example.cpp:1:29
%2:gr64 = MOV64rm %stack.0, 1, $noreg, 0, $noreg :: (load (s64) from %stack.0); example.cpp:1:29
$rax = COPY %2:gr64; example.cpp:1:22
RET 0, $rax; example.cpp:1:22
# End machine code for function fl1(double).
clang -O2 -ffast-math --target=x86_64-linux-android
# After x86-isel
# Machine code for function fl1(double): IsSSA, TracksLiveness
Function Live Ins: $xmm0 in %0
bb.0.entry:
liveins: $xmm0
%0:fr64 = COPY $xmm0
ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp; example.cpp:1:46
$xmm0 = COPY %0:fr64; example.cpp:1:46
CALL64pcrel32 target-flags(x86-plt) &__extenddftf2, <regmask $bh $bl $bp $bph $bpl $bx $ebp $ebx $hbp $hbx $rbp $rbx $r12 $r13 $r14 $r15 $r12b $r13b $r14b $r15b $r12bh $r13bh $r14bh $r15bh $r12d $r13d $r14d $r15d $r12w $r13w $r14w $r15w $r12wh and 3 more...>, implicit $rsp, implicit $ssp, implicit $xmm0, implicit-def $rsp, implicit-def $ssp, implicit-def $xmm0; example.cpp:1:46
ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp; example.cpp:1:46
%1:vr128 = COPY $xmm0; example.cpp:1:46
$xmm0 = COPY %1:vr128; example.cpp:1:29
TCRETURNdi64 target-flags(x86-plt) &lrintl, 0, <regmask $bh $bl $bp $bph $bpl $bx $ebp $ebx $hbp $hbx $rbp $rbx $r12 $r13 $r14 $r15 $r12b $r13b $r14b $r15b $r12bh $r13bh $r14bh $r15bh $r12d $r13d $r14d $r15d $r12w $r13w $r14w $r15w $r12wh and 3 more...>, implicit $rsp, implicit $ssp, implicit $xmm0; example.cpp:1:29
# End machine code for function fl1(double).
I have improved the lowering of vector lrint/llrint by #90065. I think the x86 specific issue can be closed.
Sadly this seems to not be able to see through the truncation to 32bit in this example: https://godbolt.org/z/Pofx7eP4G
with -m32
it works nicely though, great work
For https://godbolt.org/z/xKEfeq97E, the problem is long double
interpreted as fp128 instead of fp80 on android. If you don't really need long double
, you can use __builtin_lrint
.
For https://godbolt.org/z/Pofx7eP4G, the difference comes from IR, which would be SLP issue as @efriedma-quic suggested.
Both are not x86 specific issues.
yeah, no Android-supported architecture has hardware fp128 support, so that's expected. and i messed up in my previous comment --- copying directly from the bionic source now, what i'm trying to do is add __i386__
and __x86_64__
to
#if defined(__aarch64__) || defined(__riscv)
long lrint(double x) { return __builtin_lrint(x); }
long lrintf(float x) { return __builtin_lrintf(x); }
long long llrint(double x) { return __builtin_llrint(x); }
long long llrintf(float x) { return __builtin_llrintf(x); }
#endif
and the reason it doesn't work (found by @pirama-arumuga-nainar yesterday) is that we -include
a header containing #pragma STDC FENV_ACCESS ON
before this. if i hack in a quick #pragma STDC FENV_ACCESS OFF
into this file, i see the same [inlined single instruction] results from the [much] earlier godbolt link.
...but unless i misunderstood https://www.felixcloutier.com/x86/cvtsd2si "When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR register" sounds like this instruction does take #pragma STDC FENV_ACCESS ON
disables use of it for x86/x86-64?
It's not a bug. When #pragma STDC FENV_ACCESS ON
, the front end will generate llvm.experimental.constrained.lrint
. We just haven't handled the STRICT_LRINT
nodes yet.
X87 and SSE have simple rounding and converting store instructions, which are essentially equivalent to
l{0,2}rint[fl]?
Clang/LLVM does not seem to replace calls to
rint
with these, and neither does it vectorise these when used to round/convert vectors in all cases. (truncation is properly replaced)Some examples follow below
GCC is listed aswell, The main difference to them is, that they do schedule their
fldcw
for truncation earlier and replacerintl
, as well as use some bit-magic forrintf
Note: Using
f32x4
forfloat __vector(4)
andi32x4
forint __vector(4)
Note:cvtss2si
!=cvttss2si
Note: Assuming Overflows etc are UB, and HW's behaviour is acceptablerintl
call rintl@PLT
frndint
frndint
(int)rintl
call rintl@PLT
+truncationcall rintl@PLT
+truncationfistp m16/m32/m64
lrintl
call lrintl
call lrintl
fistp m16/m32/m64
lrint
call lrintl
call lrintl
cvtss2si r32/r64, xmmX
(int)rintf
call rintf@PLT;cvttss2si
cvttss2si
cvtss2si r32, xmmX
(int)rintf (SSE4.2)
roundss + cvttss2si
roundss + cvttss2si
cvtss2si r32, xmmX
4x lrintf (f32x4->i32x4)
call lrintl
)call lrintl
cvtps2dq xmmY, xmmX
Tested using glodbolt and
x86_64 Clang 14.0.0
as well asx86_64 GCC 11.2
with O2 and O3Update: Seems like most cases are now cought, only coalecsing
cvtss2si
s tocvtps2dq