Open 54aefcd4-c07d-4252-8441-723563c8826f opened 6 years ago
Don't you still need the function attributes to omit the default N-R step? Ie, what's the definition of "#1" here: define <4 x float> @rsqrt(<4 x float>) #1 {
?
Note that we're getting closer to full propagation of IR fast-math-flags to DAG flags: https://reviews.llvm.org/D46563
Not sure if that's going to change anything for the remaining problems described here.
Doesn't it already?
Indeed! The following:
define <4 x float> @rsqrt(<4 x float>) #1 { %a = call fast <4 x float> @llvm.sqrt.v4f32(<4 x float> %0) %c = fdiv fast <4 x float> <float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00>, %a ret <4 x float> %c }
generates:
rsqrt: # @rsqrt2 rsqrtps xmm0, xmm0 ret
Would it be possible for
fast
to imply bothunsafe-fp-math
andreciprocal-estimates
so that function attributes aren't necessary to enable this ?
Doesn't it already? I thought the issue is not that the reciprocal estimates are not being used, but that you want to omit the fixup Newton iterations. This later fixup step we do need in general, even in fast-math mode.
Would it be possible for fast
to imply both unsafe-fp-math
and reciprocal-estimates
so that function attributes aren't necessary to enable this ?
The approach using an attribute does not really solve my problem because ...
We've talked about, potentially, hooking up the fpmath metadata to address this issue (http://llvm.org/docs/LangRef.html#fpmath-metadata) but I don't believe that anyone has worked on this yet.
The approach using an attribute does not really solve my problem because mismatching attributes prevent inlining, resulting in call instructions, like here: https://godbolt.org/g/9tRGES
define <8 x float> @rsqrt2(<8 x float>) #1 { %a = call <8 x float> @rsqrt(<8 x float> %0) ret <8 x float> %a }
which produces
rsqrt: # @rsqrt vrsqrtps ymm0, ymm0 ret rsqrt2: # @rsqrt2 push rax call rsqrt pop rax ret
with -O3 :/
Thanks, this does work and does what I wanted. I don't know if I should close this or keep it open. I'd expect a combination of math flags in the call to llvm.sqrt and the fdiv to be able to enable this as well without using function attributes.
That's a reasonable request, so let's leave it open, but change the bug title.
But there's no way to specify the (lack of) estimate refinement without using a function attribute even after we fix all of the basic IR to DAG FMF plumbing. Getting to the target default (x86 is rsqrt + 1 NR step) should be possible.
Also based on our current FMF semantics, it's the fdiv alone that needs to have an FMF decoration ('arcp reassoc' at least I think) because that's the result that you're specifying is not strict.
If the sqrt is loose, then we could approximate that value; but if the fdiv is strict, we'd have to generate a real divps here. That would likely be something that nobody wants: an imprecise and slow result.
I am adding a vector extension for approximate reciprocal square root to the Rust frontend, so that users are able to write portable code that looks like:
let x: f32x2; let r: f32x2 = x.rsqrt(); // approximate reciprocal square root let x: f32x4; let r: f32x4 = x.rsqrt() let x: f32x8; let r: f32x8 = x.rsqrt(); let x: f64x4; let r: f64x4 = x.rsqrt() // etc.
Basically I had to choose between doing this in the Rust std library, and manually call the intrinsics of each architecture for all floating-point vector types and all archs, or try to get LLVM to emit the machine code I want and only temporarily work around those architectures in which this isn't the case (after filling in bugs for those).
Now that you've showed me how to generate the correct LLVM-IR this is the route I am going to take.
If you're starting from clang, it'll look like this: $ clang -O1 rsqrt.c -S -o - -ffast-math -mrecip=sqrtf:0 ... rsqrtss %xmm0, %xmm0 retq
Your code will need to be extremely precision-tolerant for that to work.
Yeah, that's the case. As mentioned in the topic I just want an approximation.
Yes - function attributes.
Thanks, this does work and does what I wanted. I don't know if I should close this or keep it open. I'd expect a combination of math flags in the call to llvm.sqrt and the fdiv to be able to enable this as well without using function attributes.
IIUC
rsqrtps
should be enough, all those mulps aren't necessary.
Your code will need to be extremely precision-tolerant for that to work. Per intel manual: The relative error for this approximation is: |Relative Error| ≤ 1.5 ∗ 2−12
But if you want to try that, there is a way...
Is there a way to enable -enable-unsafe-fp-math in the IR ?
Yes - function attributes. You can blame me for lack of documentation on this one. Here's the magic string you're looking for:
$ cat rsqrt.ll declare <4 x float> @llvm.sqrt.v4f32(<4 x float>) define <4 x float> @rsqrt(<4 x float>) #1 { %a = call afn <4 x float> @llvm.sqrt.v4f32(<4 x float> %0) %c = fdiv <4 x float> <float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00>, %a ret <4 x float> %c }
attributes #1 = { "unsafe-fp-math"="true" "reciprocal-estimates"="vec-sqrt:0" }
$ ./llc -o - rsqrt.ll rsqrtps %xmm0, %xmm0 retq
@Sanjay which flags should be enough to enable this optimization?
%a = call afn <4 x float> @llvm.sqrt.v4f32(<4 x float> %0)
Let me answer this 1 first - until today, 'afn' didn't exist in the DAG: https://reviews.llvm.org/D45710
But that still won't work because we don't propagate IR FMF to intrinsics yet. We'll get there someday...
I've tried that (see it live: https://godbolt.org/g/wxwxSW), it generates pretty bad code too:
.LCPI0_1: .long 3204448256 # float -0.5 .long 3204448256 # float -0.5 .long 3204448256 # float -0.5 .long 3204448256 # float -0.5 rsqrt: # @rsqrt rsqrtps xmm1, xmm0 movaps xmm2, xmm1 mulps xmm2, xmm1 mulps xmm2, xmm0 addps xmm2, xmmword ptr [rip + .LCPI0_0] mulps xmm1, xmmword ptr [rip + .LCPI0_1] mulps xmm1, xmm2 movaps xmm0, xmm1 ret
IIUC rsqrtps
should be enough, all those mulps aren't necessary.
Is there a way to enable -enable-unsafe-fp-math in the IR ?
Try llc -enable-unsafe-fp-math
Extended Description
The following LLVM IR (see it live: https://godbolt.org/g/88kuky) computes the approximate vector reciprocal square root rsqrt(x) ~= 1/ sqrt(x):
declare <4 x float> @llvm.sqrt.v4f32(<4 x float>) define <4 x float> @rsqrt(<4 x float>) { %a = call afn <4 x float> @llvm.sqrt.v4f32(<4 x float> %0) %c = fdiv <4 x float> <float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00>, %a ret <4 x float> %c }
On x86_64 with -O3 and sse4.2 they generate the following assembly:
LCPI0_0: .long 1065353216 # float 1 .long 1065353216 # float 1 .long 1065353216 # float 1 .long 1065353216 # float 1 rsqrt: # @rsqrt sqrtps xmm1, xmm0 movaps xmm0, xmmword ptr [rip + .LCPI0_0] divps xmm0, xmm1 ret
However, it should just generate a call to rsqrtps .
I've tried with fast math flags but haven't been able to generate rsqrtps yet.