vector load and store instructions (LD4, ST4) slow execution performance

llvmbot commented 4 years ago


Bugzilla Link	44655
Version	9.0
OS	Linux
Attachments	LL file snippet
Reporter	LLVM Bugzilla Contributor
CC	@Arnaud-de-Grandmaison-ARM,@efriedma-quic,@smithp35

Extended Description

The following generated assembly takes twice as long to execute versus a version that only load register in pairs (or one-by-one):

  1303 │220:   ld4    {v2.2d-v5.2d}, [x13], #&#8203;64
  4888 │       ld4    {v16.2d-v19.2d}, [x14]
 20143 │       fmla   v16.2d, v2.2d, v1.2d
    68 │       fmla   v17.2d, v3.2d, v1.2d
  1071 │       fmla   v18.2d, v4.2d, v1.2d
   293 │       fmla   v19.2d, v5.2d, v1.2d
  4524 │       st4    {v16.2d-v19.2d}, [x14], #&#8203;64
 15579 │       subs   x15, x15, #&#8203;0x2
    11 │     ↑ b.ne   220

Much better is to load in pair of scalars (even though that results in more instructions being executed):

   487 │234:   ldp    q2, q3, [x12, #&#8203;32]
  1106 │       ldp    q4, q5, [x12], #&#8203;64
  2694 │       ldp    q6, q7, [x13, #&#8203;32]
  2898 │       ldp    q16, q17, [x13]
  3847 │       subs   x14, x14, #&#8203;0x2
  5440 │       fmla   v6.2d, v2.2d, v1.2d
  1689 │       fmla   v16.2d, v4.2d, v1.2d
  3530 │       fmla   v17.2d, v5.2d, v1.2d
  1315 │       fmla   v7.2d, v3.2d, v1.2d
   135 │       stp    q6, q7, [x13, #&#8203;32]
   865 │       stp    q16, q17, [x13], #&#8203;64
  2649 │     ↑ b.ne   234

This assembly is generated from running a simple DAXPY loop unrolled by a factor of 4. Attached is a snippet of the ll file.

Two questions, The slow code is only generated when opt is passed '-O2', which pass could be responsible for vectorizing these loads and stores? Secondly, what is the rationale for generating LD4/ST4 instructions if they execute so much slower that there scalar equivalent versions?

llvmbot commented 4 years ago

I tested this on two CPUs: Cavium's ThunderX2 99xx and Nvidia's ARMv8 processor (xavier).

efriedma-quic commented 4 years ago

Which already improves the case where LD4/ST4 are used even if it is not as efficient as case when the load are sequenced together.

That loop is doing half the amount of work per iteration of the loop, vs. the other versions; is ld4 really that slow? If it is, maybe we should disable generating it, at least for 64-bit types.

For reference, what CPU are you using to test?

llvmbot commented 4 years ago

By disabling the interleaving of accesses the following code is generated:

   795 │184:   ldp    d2, d3, [x10, #-16]
  3472 │       ldur   d1, [x9, #-16]
  2985 │       fmadd  d1, d0, d1, d2
   981 │       stur   d1, [x10, #-16]
  3143 │       sub    w8, w8, #&#8203;0x1
    29 │       ldur   d1, [x9, #-8]
  1160 │       fmadd  d1, d0, d1, d3
  4113 │       stur   d1, [x10, #-8]
   763 │       ldp    d2, d3, [x10]
    92 │       ldr    d1, [x9]
  6472 │       fmadd  d1, d0, d1, d2
  1037 │       str    d1, [x10]
  2229 │       ldr    d1, [x9, #&#8203;8]
     9 │       add    x9, x9, #&#8203;0x20
  2110 │       fmadd  d1, d0, d1, d3
  4106 │       str    d1, [x10, #&#8203;8]
     1 │       add    x10, x10, #&#8203;0x20
    12 │       cmp    w8, #&#8203;0x0
       │     ↑ b.gt   184

Which already improves the case where LD4/ST4 are used even if it is not as efficient as case when the load are sequenced together.

efriedma-quic commented 4 years ago

Disabling this pass prevents the ld4/st4 instruction from being generated.

That's technically true, but I'm guessing you still don't get the result you want. (Hard to say for sure without a reproducible testcase.) Two ldps don't produce the same result as a vld4.

The "trick" you're using to optimize the code is recognizing that each of the four unrolled operations is identical. LLVM's loop vectorizer doesn't have code to recognize that, though, so it's generating code for the general case: vld4 to load and rearrange the four lanes, four independent operations, and vst4 to store the four lanes.

llvmbot commented 4 years ago

See this change that enabled the interleaving of accesses:

https://reviews.llvm.org/D12145

Disabling this pass prevents the ld4/st4 instruction from being generated.

llvm / llvm-project

vector load and store instructions (LD4, ST4) slow execution performance #44000

Extended Description