llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
29.01k stars 11.96k forks source link

IR fast-math-flags should enable approximate reciprocal square root #36692

Open 54aefcd4-c07d-4252-8441-723563c8826f opened 6 years ago

54aefcd4-c07d-4252-8441-723563c8826f commented 6 years ago
Bugzilla Link 37344
Version trunk
OS All
CC @topperc,@ecnelises,@hfinkel,@RKSimon,@rotateright

Extended Description

The following LLVM IR (see it live: https://godbolt.org/g/88kuky) computes the approximate vector reciprocal square root rsqrt(x) ~= 1/ sqrt(x):

declare <4 x float> @​llvm.sqrt.v4f32(<4 x float>) define <4 x float> @​rsqrt(<4 x float>) { %a = call afn <4 x float> @​llvm.sqrt.v4f32(<4 x float> %0) %c = fdiv <4 x float> <float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00>, %a ret <4 x float> %c }

On x86_64 with -O3 and sse4.2 they generate the following assembly:

LCPI0_0: .long 1065353216 # float 1 .long 1065353216 # float 1 .long 1065353216 # float 1 .long 1065353216 # float 1 rsqrt: # @​rsqrt sqrtps xmm1, xmm0 movaps xmm0, xmmword ptr [rip + .LCPI0_0] divps xmm0, xmm1 ret

However, it should just generate a call to rsqrtps .

I've tried with fast math flags but haven't been able to generate rsqrtps yet.

rotateright commented 6 years ago

Don't you still need the function attributes to omit the default N-R step? Ie, what's the definition of "#​1" here: define <4 x float> @​rsqrt(<4 x float>) #​1 {

?

Note that we're getting closer to full propagation of IR fast-math-flags to DAG flags: https://reviews.llvm.org/D46563

Not sure if that's going to change anything for the remaining problems described here.

54aefcd4-c07d-4252-8441-723563c8826f commented 6 years ago

Doesn't it already?

Indeed! The following:

define <4 x float> @​rsqrt(<4 x float>) #​1 { %a = call fast <4 x float> @​llvm.sqrt.v4f32(<4 x float> %0) %c = fdiv fast <4 x float> <float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00>, %a ret <4 x float> %c }

generates:

rsqrt: # @​rsqrt2 rsqrtps xmm0, xmm0 ret

hfinkel commented 6 years ago

Would it be possible for fast to imply both unsafe-fp-math and reciprocal-estimates so that function attributes aren't necessary to enable this ?

Doesn't it already? I thought the issue is not that the reciprocal estimates are not being used, but that you want to omit the fixup Newton iterations. This later fixup step we do need in general, even in fast-math mode.

54aefcd4-c07d-4252-8441-723563c8826f commented 6 years ago

Would it be possible for fast to imply both unsafe-fp-math and reciprocal-estimates so that function attributes aren't necessary to enable this ?

hfinkel commented 6 years ago

The approach using an attribute does not really solve my problem because ...

We've talked about, potentially, hooking up the fpmath metadata to address this issue (http://llvm.org/docs/LangRef.html#fpmath-metadata) but I don't believe that anyone has worked on this yet.

54aefcd4-c07d-4252-8441-723563c8826f commented 6 years ago

The approach using an attribute does not really solve my problem because mismatching attributes prevent inlining, resulting in call instructions, like here: https://godbolt.org/g/9tRGES

define <8 x float> @​rsqrt2(<8 x float>) #​1 { %a = call <8 x float> @​rsqrt(<8 x float> %0) ret <8 x float> %a }

which produces

rsqrt: # @​rsqrt vrsqrtps ymm0, ymm0 ret rsqrt2: # @​rsqrt2 push rax call rsqrt pop rax ret

with -O3 :/

rotateright commented 6 years ago

Thanks, this does work and does what I wanted. I don't know if I should close this or keep it open. I'd expect a combination of math flags in the call to llvm.sqrt and the fdiv to be able to enable this as well without using function attributes.

That's a reasonable request, so let's leave it open, but change the bug title.

But there's no way to specify the (lack of) estimate refinement without using a function attribute even after we fix all of the basic IR to DAG FMF plumbing. Getting to the target default (x86 is rsqrt + 1 NR step) should be possible.

Also based on our current FMF semantics, it's the fdiv alone that needs to have an FMF decoration ('arcp reassoc' at least I think) because that's the result that you're specifying is not strict.

If the sqrt is loose, then we could approximate that value; but if the fdiv is strict, we'd have to generate a real divps here. That would likely be something that nobody wants: an imprecise and slow result.

54aefcd4-c07d-4252-8441-723563c8826f commented 6 years ago

I am adding a vector extension for approximate reciprocal square root to the Rust frontend, so that users are able to write portable code that looks like:

let x: f32x2; let r: f32x2 = x.rsqrt(); // approximate reciprocal square root let x: f32x4; let r: f32x4 = x.rsqrt() let x: f32x8; let r: f32x8 = x.rsqrt(); let x: f64x4; let r: f64x4 = x.rsqrt() // etc.

Basically I had to choose between doing this in the Rust std library, and manually call the intrinsics of each architecture for all floating-point vector types and all archs, or try to get LLVM to emit the machine code I want and only temporarily work around those architectures in which this isn't the case (after filling in bugs for those).

Now that you've showed me how to generate the correct LLVM-IR this is the route I am going to take.

rotateright commented 6 years ago

If you're starting from clang, it'll look like this: $ clang -O1 rsqrt.c -S -o - -ffast-math -mrecip=sqrtf:0 ... rsqrtss %xmm0, %xmm0 retq

54aefcd4-c07d-4252-8441-723563c8826f commented 6 years ago

Your code will need to be extremely precision-tolerant for that to work.

Yeah, that's the case. As mentioned in the topic I just want an approximation.

Yes - function attributes.

Thanks, this does work and does what I wanted. I don't know if I should close this or keep it open. I'd expect a combination of math flags in the call to llvm.sqrt and the fdiv to be able to enable this as well without using function attributes.

rotateright commented 6 years ago

IIUC rsqrtps should be enough, all those mulps aren't necessary.

Your code will need to be extremely precision-tolerant for that to work. Per intel manual: The relative error for this approximation is: |Relative Error| ≤ 1.5 ∗ 2−12

But if you want to try that, there is a way...

Is there a way to enable -enable-unsafe-fp-math in the IR ?

Yes - function attributes. You can blame me for lack of documentation on this one. Here's the magic string you're looking for:

$ cat rsqrt.ll declare <4 x float> @​llvm.sqrt.v4f32(<4 x float>) define <4 x float> @​rsqrt(<4 x float>) #​1 { %a = call afn <4 x float> @​llvm.sqrt.v4f32(<4 x float> %0) %c = fdiv <4 x float> <float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00>, %a ret <4 x float> %c }

attributes #​1 = { "unsafe-fp-math"="true" "reciprocal-estimates"="vec-sqrt:0" }

$ ./llc -o - rsqrt.ll rsqrtps %xmm0, %xmm0 retq

54aefcd4-c07d-4252-8441-723563c8826f commented 6 years ago

@​Sanjay which flags should be enough to enable this optimization?

rotateright commented 6 years ago

%a = call afn <4 x float> @​llvm.sqrt.v4f32(<4 x float> %0)

Let me answer this 1 first - until today, 'afn' didn't exist in the DAG: https://reviews.llvm.org/D45710

But that still won't work because we don't propagate IR FMF to intrinsics yet. We'll get there someday...

54aefcd4-c07d-4252-8441-723563c8826f commented 6 years ago

I've tried that (see it live: https://godbolt.org/g/wxwxSW), it generates pretty bad code too:

.LCPI0_1: .long 3204448256 # float -0.5 .long 3204448256 # float -0.5 .long 3204448256 # float -0.5 .long 3204448256 # float -0.5 rsqrt: # @​rsqrt rsqrtps xmm1, xmm0 movaps xmm2, xmm1 mulps xmm2, xmm1 mulps xmm2, xmm0 addps xmm2, xmmword ptr [rip + .LCPI0_0] mulps xmm1, xmmword ptr [rip + .LCPI0_1] mulps xmm1, xmm2 movaps xmm0, xmm1 ret

IIUC rsqrtps should be enough, all those mulps aren't necessary.

Is there a way to enable -enable-unsafe-fp-math in the IR ?

topperc commented 6 years ago

Try llc -enable-unsafe-fp-math