Open llvmbot opened 4 years ago
I tested this on two CPUs: Cavium's ThunderX2 99xx and Nvidia's ARMv8 processor (xavier).
Which already improves the case where LD4/ST4 are used even if it is not as efficient as case when the load are sequenced together.
That loop is doing half the amount of work per iteration of the loop, vs. the other versions; is ld4 really that slow? If it is, maybe we should disable generating it, at least for 64-bit types.
For reference, what CPU are you using to test?
By disabling the interleaving of accesses the following code is generated:
795 │184: ldp d2, d3, [x10, #-16]
3472 │ ldur d1, [x9, #-16]
2985 │ fmadd d1, d0, d1, d2
981 │ stur d1, [x10, #-16]
3143 │ sub w8, w8, #​0x1
29 │ ldur d1, [x9, #-8]
1160 │ fmadd d1, d0, d1, d3
4113 │ stur d1, [x10, #-8]
763 │ ldp d2, d3, [x10]
92 │ ldr d1, [x9]
6472 │ fmadd d1, d0, d1, d2
1037 │ str d1, [x10]
2229 │ ldr d1, [x9, #​8]
9 │ add x9, x9, #​0x20
2110 │ fmadd d1, d0, d1, d3
4106 │ str d1, [x10, #​8]
1 │ add x10, x10, #​0x20
12 │ cmp w8, #​0x0
│ ↑ b.gt 184
Which already improves the case where LD4/ST4 are used even if it is not as efficient as case when the load are sequenced together.
Disabling this pass prevents the ld4/st4 instruction from being generated.
That's technically true, but I'm guessing you still don't get the result you want. (Hard to say for sure without a reproducible testcase.) Two ldps don't produce the same result as a vld4.
The "trick" you're using to optimize the code is recognizing that each of the four unrolled operations is identical. LLVM's loop vectorizer doesn't have code to recognize that, though, so it's generating code for the general case: vld4 to load and rearrange the four lanes, four independent operations, and vst4 to store the four lanes.
See this change that enabled the interleaving of accesses:
https://reviews.llvm.org/D12145
Disabling this pass prevents the ld4/st4 instruction from being generated.
Extended Description
The following generated assembly takes twice as long to execute versus a version that only load register in pairs (or one-by-one):
Much better is to load in pair of scalars (even though that results in more instructions being executed):
This assembly is generated from running a simple DAXPY loop unrolled by a factor of 4. Attached is a snippet of the ll file.
Two questions, The slow code is only generated when opt is passed '-O2', which pass could be responsible for vectorizing these loads and stores? Secondly, what is the rationale for generating LD4/ST4 instructions if they execute so much slower that there scalar equivalent versions?