llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
29.3k stars 12.11k forks source link

[AArch64] gather struct load should be reused similar to normal struct load #107345

Open vfdff opened 2 months ago

vfdff commented 2 months ago
llvmbot commented 2 months ago

@llvm/issue-subscribers-backend-aarch64

Author: Allen (vfdff)

* normal struct load case: https://godbolt.org/z/fGYKoM8W3 ``` for (int i = 0; i < eulers_per_block; i ++) { #pragma clang loop vectorize(enable) #pragma GCC ivdep for (int tid=0; tid<block_sz; tid++){ int index = tid; s_ref_real[i][tid] = mdlComplex[index].real(); s_ref_imag[i][tid] = mdlComplex[index].imag(); } } ``` * related assemble code generated by llvm: It works **fine** as we load the both **real** and **image** part one ``` .LBB0_2: ld2w { z0.s, z1.s }, p0/z, [x22] add x22, x22, x13 st1w { z0.s }, p0, [x10, x21, lsl #2] # r0, r1, ... r7 (assme VScale=2) st1w { z1.s }, p0, [x11, x21, lsl #2] #i0, i1, .... i7 add x21, x21, x12 cmp x21, #256 b.ne .LBB0_2 ``` * gather struct load: https://godbolt.org/z/b5GoT4qqv ``` for (int i = 0; i < eulers_per_block; i ++) { #pragma clang loop vectorize(enable) #pragma GCC ivdep for (int tid=0; tid<block_sz; tid++){ int index = indexarr[tid]; s_ref_real[i][tid] = mdlComplex[index].real(); s_ref_imag[i][tid] = mdlComplex[index].imag(); } } ``` * related assemble code generated by llvm: It **double load** the real and image parts ``` .LBB0_2: add x22, x9, x21, lsl #2 ld1sw { z0.d }, p0/z, [x9, x21, lsl #2] #index.0, index.1, .... index.7 (assme VScale=2) ld1sw { z2.d }, p0/z, [x22, #1, mul vl] #index.8, index.9, .... index.15 add x22, x10, #4 lsl z0.d, z0.d, #3 lsl z2.d, z2.d, #3 ld1w { z1.d }, p0/z, [x10, z0.d] # r0, i0, r1, ... i3 ld1w { z3.d }, p0/z, [x10, z2.d] # r4, i4, r5, ... i7 uzp1 z1.s, z1.s, z3.s # r0, r1, r2, ... r7 st1w { z1.s }, p1, [x11, x21, lsl #2] ld1w { z0.d }, p0/z, [x22, z0.d] # i0, r1, i1, ... i4 , can reused with the above ld1w ? ld1w { z1.d }, p0/z, [x22, z2.d] # i4, r5, i5, ... i8 uzp1 z0.s, z0.s, z1.s # i0, i1, i2, ... i7 st1w { z0.s }, p1, [x12, x21, lsl #2] add x21, x21, x13 cmp x21, #256 b.ne .LBB0_2 ```
vfdff commented 2 months ago
vfdff commented 2 months ago