Missed optimization in math expression: sqrt(sqrt(a)) == pow(a, 1/4)

llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.

http://llvm.org

Other

29.14k stars 12.02k forks source link

Missed optimization in math expression: sqrt(sqrt(a)) == pow(a, 1/4) #34948

Open zamazan4ik opened 6 years ago

zamazan4ik commented 6 years ago


Bugzilla Link	35600
Version	trunk
OS	All
Blocks	llvm/llvm-project#34959
CC	@hfinkel,@RKSimon,@rotateright

Extended Description

clang(trunk) with '--std=c++17 -O3 -march=native -ffast-math' flags for this code:

#include <cmath>

double test(double a)
{
    return sqrt(sqrt(a));
}

generates this assembly:

test(double): # @test(double)
  vsqrtsd xmm0, xmm0, xmm0
  vsqrtsd xmm0, xmm0, xmm0
  ret

But there is formula: sqrt(sqrt(a)) == pow(a, 1/4). And it can be compiled in faster way.

RKSimon commented 2 years ago

mentioned in issue llvm/llvm-project#34959

rotateright commented 6 years ago

There's a proposal to convert pow(x, 0.25) to sqrt(sqrt(x)) here: https://reviews.llvm.org/D49306

zamazan4ik commented 6 years ago

See a lot of optimizations here: https://github.com/gcc-mirror/gcc/blob/07b69d3f1cd3dd8ebb0af1fbff95914daee477d2/gcc/match.pd

RKSimon commented 6 years ago

Yes, many sqrts might be slower - but the 2*sqrt case is very likely to be quicker.

Some GPUs might be able to do this with a dedicated pow instruction, and a soft pow implementation might avoid a bottleneck on a sqrt unit but trying to compare costs of hw instructions AND libm implementations is likely to be a nightmare.

zamazan4ik commented 6 years ago

if you will check this code:

#include <cmath>

double test(double a)
{
    return sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(a)))))));
}

clang(trunk) with '--std=c++17 -O3 -march=native -ffast-math' generates here:

test(double): # @test(double)
  vsqrtsd xmm0, xmm0, xmm0
  vsqrtsd xmm0, xmm0, xmm0
  vsqrtsd xmm0, xmm0, xmm0
  vsqrtsd xmm0, xmm0, xmm0
  vsqrtsd xmm0, xmm0, xmm0
  vsqrtsd xmm0, xmm0, xmm0
  vsqrtsd xmm0, xmm0, xmm0
  ret

but gcc(trunk) with '--std=c++17 -O3 -march=native -ffast-math' generates:

test(double):
        vandpd  xmm0, xmm0, XMMWORD PTR .LC1[rip]
        vmovsd  xmm1, QWORD PTR .LC0[rip]
        jmp     __pow_finite
.LC0:
        .long   0
        .long   1065353216
.LC1:
        .long   4294967295
        .long   2147483647
        .long   0
        .long   0

And here i am sure that pow is faster than chain of sqrt functions.

hfinkel commented 6 years ago

Is calling pow actually faster? I'd think that the optimization here might actually go the other way?