Open zamazan4ik opened 6 years ago
mentioned in issue llvm/llvm-project#34959
There's a proposal to convert pow(x, 0.25) to sqrt(sqrt(x)) here: https://reviews.llvm.org/D49306
See a lot of optimizations here: https://github.com/gcc-mirror/gcc/blob/07b69d3f1cd3dd8ebb0af1fbff95914daee477d2/gcc/match.pd
Yes, many sqrts might be slower - but the 2*sqrt case is very likely to be quicker.
Some GPUs might be able to do this with a dedicated pow instruction, and a soft pow implementation might avoid a bottleneck on a sqrt unit but trying to compare costs of hw instructions AND libm implementations is likely to be a nightmare.
if you will check this code:
#include <cmath>
double test(double a)
{
return sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(a)))))));
}
clang(trunk) with '--std=c++17 -O3 -march=native -ffast-math' generates here:
test(double): # @test(double)
vsqrtsd xmm0, xmm0, xmm0
vsqrtsd xmm0, xmm0, xmm0
vsqrtsd xmm0, xmm0, xmm0
vsqrtsd xmm0, xmm0, xmm0
vsqrtsd xmm0, xmm0, xmm0
vsqrtsd xmm0, xmm0, xmm0
vsqrtsd xmm0, xmm0, xmm0
ret
but gcc(trunk) with '--std=c++17 -O3 -march=native -ffast-math' generates:
test(double):
vandpd xmm0, xmm0, XMMWORD PTR .LC1[rip]
vmovsd xmm1, QWORD PTR .LC0[rip]
jmp __pow_finite
.LC0:
.long 0
.long 1065353216
.LC1:
.long 4294967295
.long 2147483647
.long 0
.long 0
And here i am sure that pow is faster than chain of sqrt functions.
Is calling pow actually faster? I'd think that the optimization here might actually go the other way?
Extended Description
clang(trunk) with '--std=c++17 -O3 -march=native -ffast-math' flags for this code:
generates this assembly:
But there is formula: sqrt(sqrt(a)) == pow(a, 1/4). And it can be compiled in faster way.