[AArch64] The option -alu-lsl-fast don't split the x * 3 into a shift for vector

vfdff commented 3 months ago

test: https://gcc.godbolt.org/z/jW43j3bfn

void foo (int * __restrict a, int * b, int N) {
for (int i = 0; i < N; ++i)
{
  a[4*i + 0] = b[4*i + 0] * 3;
  a[4*i + 1] = b[4*i + 1] + 3;
  a[4*i + 2] = (b[4*i + 2] * 3 + 3);
  a[4*i + 3] = b[4*i + 3] * 3;
}
}

gcc use shl to replace the x * 3, the kernel body of gcc; while clang don't even with -Xclang -target-feature -Xclang +alu-lsl-fast

.L4:
    ld4     {v28.4s - v31.4s}, [x3], 64
    add     v0.4s, v24.4s, v30.4s
    shl     v26.4s, v28.4s, 1
    add     v27.4s, v25.4s, v29.4s
    shl     v29.4s, v31.4s, 1
    add     v26.4s, v26.4s, v28.4s
    shl     v28.4s, v0.4s, 1
    add     v29.4s, v29.4s, v31.4s
    add     v28.4s, v28.4s, v0.4s
    st4     {v26.4s - v29.4s}, [x4], 64
    cmp     x5, x3
    bne     .L4
    and     w7, w2, -4
    cmp     w2, w7
    beq     .L1

llvmbot commented 3 months ago

@llvm/issue-subscribers-backend-aarch64

Author: Allen (vfdff)

* test: https://gcc.godbolt.org/z/jW43j3bfn ``` void foo (int * __restrict a, int * b, int N) { for (int i = 0; i < N; ++i) { a[4*i + 0] = b[4*i + 0] * 3; a[4*i + 1] = b[4*i + 1] + 3; a[4*i + 2] = (b[4*i + 2] * 3 + 3); a[4*i + 3] = b[4*i + 3] * 3; } } ``` * gcc use **shl** to replace the `x * 3`, the kernel body of gcc; while clang don't even with **-Xclang -target-feature -Xclang +alu-lsl-fast** ``` .L4: ld4 {v28.4s - v31.4s}, [x3], 64 add v0.4s, v24.4s, v30.4s shl v26.4s, v28.4s, 1 add v27.4s, v25.4s, v29.4s shl v29.4s, v31.4s, 1 add v26.4s, v26.4s, v28.4s shl v28.4s, v0.4s, 1 add v29.4s, v29.4s, v31.4s add v28.4s, v28.4s, v0.4s st4 {v26.4s - v29.4s}, [x4], 64 cmp x5, x3 bne .L4 and w7, w2, -4 cmp w2, w7 beq .L1 ```

vfdff commented 3 months ago

It works fine with scalar type, https://gcc.godbolt.org/z/M1x5MeP5b

llvm / llvm-project

[AArch64] The option -alu-lsl-fast don't split the x * 3 into a shift for vector #94572