llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.66k stars 11.85k forks source link

LLVM misses reciprocal estimate instructions in ISel on ARMv7/8 #27481

Open tycho opened 8 years ago

tycho commented 8 years ago
Bugzilla Link 27107
Version trunk
OS All
CC @jmolloy,@rengolin,@rotateright

Extended Description

LLVM is missing the opportunity to use VRSQRTE/VRSQRTS (and the other reciprocal estimate instructions):

$ cat rsqrt.c

include

float rsqrtf(float f) { return 1.0f / sqrtf(f); }

$ clang -O3 -mcpu=native -mfpu=neon -mfloat-abi=hard -ffast-math -S -o - rsqrt.c | showasm

rsqrtf: @ @​rsqrtf vsqrt.f32 s0, s0 vmov.f32 s2, #​1.000000e+00 vdiv.f32 s0, s2, s0 bx lr

Conversely, on x86_64, LLVM does the right thing:

$ clang -O3 -march=core-avx2 -ffast-math -S -o - rsqrt.c | showasm rsqrtf: # @​rsqrtf vrsqrtss %xmm0, %xmm0, %xmm1 vmulss %xmm1, %xmm1, %xmm2 vfmadd213ss .LCPI1_0(%rip), %xmm0, %xmm2 vmulss .LCPI1_1(%rip), %xmm1, %xmm0 vmulss %xmm0, %xmm2, %xmm0 retq

It will even apply this properly to packed vectors if the inputs make sense for it.

Right now the lack of reciprocal square root estimates on ARM breaks auto-vectorization for a silly program I wrote, and the hand-written NEON intrinsics version is beating the auto-vectorized variants (because the auto-vectorization fails and everything gets bottlenecked around vsqrt+vdiv).

I looked at implementing this myself but got confused trying to understand tablegen syntax. It looks like there just needs to be an ARMTargetLowering implementation for TargetLowering::getRsqrtEstimate and TargetLowering::getRecipEstimate.

llvmbot commented 8 years ago

Status update:

  1. Plan for this task: a. try to emit rsqrt. b. Benchmarking on two strategies. c. Making the final decision.

  2. Some preliminary tests: a. running nbody on ARM(dragon) base on patch:
    https://reviews.llvm.org/D24911 (It's basically the same as Evandro's patch,not saying it's the final patch.)
    b. running nbody on AArch64(amd) base on Evandro's patch.

    In the tests,I just opened the supporting for operations in specific type at a time. For now,we can see performance improvements only in v4f32 both on ARM & AArch64,and performance improvements in f32 for "CPU_SOA_tiled" on ARM.The details can be seen in the attachments.

No definite conclusion.
TODO: More investigation.

llvmbot commented 8 years ago

nbody on AArch64

llvmbot commented 8 years ago

nbody on ARM

rotateright commented 8 years ago

Another link for reference: https://reviews.llvm.org/D24816

I've proposed to make reciprocal estimate codegen a function-level attribute as a work-around for the LTO problem.

There are a lot of different topics intertwined in that discussion including the backend's reliance on global settings. Depending on how you set the target defaults for ARM and/or AArch64, the -mrecip interface between clang and LLVM may or may not be relevant.

rengolin commented 8 years ago

Evandro, this is related to your reverted patch, and Jojo was having a look, too. I think we should sync to make sure this works without breaking LTO.

We can try this on ARM while you try on AArch64, or we can both try on different AArch64 first, then port down to ARM, whatever makes the collaboration easier.

We have access to AMD (A57), APM (XGene) and HiKeys (A53). James has some Junos that maybe we can run some tests later on.

rotateright commented 8 years ago

For reference, linking the x86 recip/rsqrt codegen bugs: bug 20900 bug 21385

tycho commented 8 years ago

Hi James,

Thanks for your detailed reply!

I've heard the latency/throughput arguments about reciprocals before, and I understand it's not simple to evalute the costs/benefits. There is already a facility to allow developers to opt in to specific fastmath reciprocal estimates: -mrecip. Right now it just doesn't do anything on ARM. Making the compiler capable of emitting the reciprocal estimate instructions seems orthogonal to making it the default behavior, and the latter question can be evaluated more carefully later down the line. :)

And to answer your question, no I don't have a specific "real" program that would benefit from it. I'm sure there are real programs that do vector normalization, but I'm not aware of them (other than Quake, hah).

The test case in question is a physics simulation (n-body). You can find the sources here, if you'd like to try it out (build with 'make CC="clang" NO_CUDA=1' or similar):

https://github.com/tycho/nbody

It relies on OpenMP for parallelism (optionally, disabled with NO_OPENMP=1) and on compiler auto-vectorization for per-thread throughput.

Right now the non-intrinsic variants are horribly slow (Cortex-A15 used in this example):

$ ./nbody --bodies 32 --no-crosscheck --iterations 1 --cycle-after 3 Running simulation with 32768 particles, crosscheck disabled, CPU enabled, 4 threads CPU_SOA: 4887.12 ms = 219.708x10^6 interactions/s ( 4.39 GFLOPS) CPU_SOA: 4787.85 ms = 224.264x10^6 interactions/s ( 4.49 GFLOPS) CPU_SOA: 4782.07 ms = 224.535x10^6 interactions/s ( 4.49 GFLOPS) CPU_SOA_tiled: 4689.76 ms = 228.954x10^6 interactions/s ( 4.58 GFLOPS) CPU_SOA_tiled: 4695.57 ms = 228.671x10^6 interactions/s ( 4.57 GFLOPS) CPU_SOA_tiled: 4692.97 ms = 228.798x10^6 interactions/s ( 4.58 GFLOPS) NEON intrin: 1637.50 ms = 655.721x10^6 interactions/s ( 13.11 GFLOPS) NEON intrin: 1607.94 ms = 667.774x10^6 interactions/s ( 13.36 GFLOPS) NEON intrin: 1615.91 ms = 664.481x10^6 interactions/s ( 13.29 GFLOPS) CPU_AOS: 5049.33 ms = 212.650x10^6 interactions/s ( 4.25 GFLOPS) CPU_AOS: 5047.26 ms = 212.738x10^6 interactions/s ( 4.25 GFLOPS) CPU_AOS: 5047.67 ms = 212.720x10^6 interactions/s ( 4.25 GFLOPS) CPU_AOS_tiled: 5105.87 ms = 210.296x10^6 interactions/s ( 4.21 GFLOPS) CPU_AOS_tiled: 5110.79 ms = 210.093x10^6 interactions/s ( 4.20 GFLOPS) CPU_AOS_tiled: 5108.08 ms = 210.205x10^6 interactions/s ( 4.20 GFLOPS)

The bottleneck in the non-NEON implementations is this:

static inline float rsqrtf(float f) { return 1.0f / sqrtf(f); }

static void bodyBodyInteraction( float fx, float fy, float *fz, float x0, float y0, float z0, float x1, float y1, float z1, float mass1, float softeningSquared) { float dx = x1 - x0; float dy = y1 - y0; float dz = z1 - z0;

float distSqr = dx*dx + dy*dy + dz*dz;
distSqr += softeningSquared;

float invDist = rsqrtf(distSqr); // prevents auto-vectorization from doing its job

float invDistCube =  invDist * invDist * invDist;
float s = mass1 * invDistCube;

*fx = dx * s;
*fy = dy * s;
*fz = dz * s;

}

It's interesting to note that Clang's codegen for SOA/SOA_tiled outperform the AVX intrinsics variant on x86_64:

$ ./nbody --bodies 128 --no-crosscheck --iterations 1 --cycle-after 3 Running simulation with 131072 particles, crosscheck disabled, CPU enabled, 32 threads CPU_SOA: 759.61 ms = 22.617x10^9 interactions/s ( 452.33 GFLOPS) CPU_SOA: 748.76 ms = 22.944x10^9 interactions/s ( 458.89 GFLOPS) CPU_SOA: 756.07 ms = 22.723x10^9 interactions/s ( 454.45 GFLOPS) CPU_SOA_tiled: 747.42 ms = 22.986x10^9 interactions/s ( 459.71 GFLOPS) CPU_SOA_tiled: 747.76 ms = 22.975x10^9 interactions/s ( 459.50 GFLOPS) CPU_SOA_tiled: 750.04 ms = 22.905x10^9 interactions/s ( 458.11 GFLOPS) AVX intrin: 856.90 ms = 20.049x10^9 interactions/s ( 400.98 GFLOPS) AVX intrin: 851.30 ms = 20.181x10^9 interactions/s ( 403.62 GFLOPS) AVX intrin: 856.85 ms = 20.050x10^9 interactions/s ( 401.00 GFLOPS) CPU_AOS: 2195.24 ms = 7.826x10^9 interactions/s ( 156.52 GFLOPS) CPU_AOS: 2195.59 ms = 7.825x10^9 interactions/s ( 156.49 GFLOPS) CPU_AOS: 2218.18 ms = 7.745x10^9 interactions/s ( 154.90 GFLOPS) CPU_AOS_tiled: 2099.47 ms = 8.183x10^9 interactions/s ( 163.66 GFLOPS) CPU_AOS_tiled: 2099.55 ms = 8.183x10^9 interactions/s ( 163.65 GFLOPS) CPU_AOS_tiled: 2310.61 ms = 7.435x10^9 interactions/s ( 148.70 GFLOPS)

jmolloy commented 8 years ago

Hi Steven,

Thanks for raising this. Actually, just emitting the rsqrt instructions isn't the hard part of this. The hard part is deciding when they are profitable.

For a single rsqrtf outside of a loop (such as your rsqrtf() function) it is almost always more efficient to use the sqrt instruction. Note that our lowering isn't very good at the moment: we're doing:

1.0 / sqrt(x)

What we should be doing is this:

(1.0 / x) * sqrt(x)

(because 1/sqrt(x) == sqrt(x) / x). This would make the division and square root doable in parallel.

Now, rsqrt[es]'s advantage is that they don't block the pipeline (because they are not iterative instructions), so throughput is increased. However, latency is also increased as more instructions have to be executed. So swapping a sqrt for a rsqrt[es] sequence will often make code slower as in most cases sqrts, rsqrts and divisions are on the critical path.

The main case where rsqrt will give a heavy speedup is in a loop where there are no dependencies on the result of the sqrt. For example a trivial loop where we calculate the sqrt of all members of an array independently:

for (i : ...) { b[i] = rsqrt(a[i]); // Fast with rsqrte }

Because it doesn't matter how long the rsqrt takes - there's nothing dependent on its result. Throughput is key in this loop.

The only example I've seen in practice of this kind of loop is vector normalization:

for (v : all_vectors) { vector_magnitude[v] = sqrt(a[v].x a[v].x + a[v].y a[v].y + a[v].z * a[v].z); }

I haven't seen any other examples in the wild where rsqrt would unequivocally improve performance.

I'm interested: other than your trivial testcase, have you seen code where rsqrt will improve performance? Would you mind sharing that?

The other issue is that the tradeoff between sqrt and rsqrt is heavily microarchitectural, which means a lot of benchmarking.

If we wanted to do this, we'd probably be best doing it as a MachineCombiner peephole, as this is a critical path / resource height issue.

Cheers,

James

tycho commented 8 years ago

assigned to @rengolin