[AArch64] VLA slower than VLS (tsvc, s173)

sjoerdmeijer commented 1 year ago

We are behind a lot compared to GCC. Compile this input with -O3 -mcpu=neoverse-v2 -ffast-math:

__attribute__((aligned(64))) float a[32000],b[32000],c[32000],d[32000],e[32000],
                                   aa[256][256],bb[256][256],cc[256][256],tt[256][256];

int dummy(float[32000], float[32000], float[32000], float[32000], float[32000], float[256][256], float[256][256], float[256][256], float);

float s173()
{
    int k = 32000/2;
    for (int nl = 0; nl < 10*100000; nl++) {
        for (int i = 0; i < 32000/2; i++) {
            a[i+k] = a[i] + b[i];
        }
        dummy(a, b, c, d, e, aa, bb, cc, 0.);
    }
}

Clang's codegen:

.LBB0_3:                                //   Parent Loop BB0_2 Depth=1
        add     x9, x19, x8, lsl #2
        add     x10, x20, x8, lsl #2
        ld1w    { z0.s }, p0/z, [x19, x8, lsl #2]
        ld1w    { z2.s }, p0/z, [x20, x8, lsl #2]
        add     x8, x8, x21
        ld1w    { z1.s }, p0/z, [x9, x28, lsl #2]
        ld1w    { z3.s }, p0/z, [x10, x28, lsl #2]
        add     x10, x9, x26
        cmp     x8, x22
        fadd    z0.s, z2.s, z0.s
        fadd    z1.s, z3.s, z1.s
        st1w    { z0.s }, p0, [x9, x23, lsl #2]
        st1w    { z1.s }, p0, [x10, x28, lsl #2]
        b.ne    .LBB0_3

vs. GCC's codegen:

.L3:
        ldr     q31, [x20, x0]
        ldr     q30, [x19, x0]
        fadd    v31.4s, v31.4s, v30.4s
        str     q31, [x21, x0]
        add     x0, x0, 16
        cmp     x0, x28
        bne     .L3

Might be caused by the same underlying issue as: https://github.com/llvm/llvm-project/issues/71524

llvmbot commented 1 year ago

@llvm/issue-subscribers-backend-aarch64

Author: Sjoerd Meijer (sjoerdmeijer)

We are behind a lot compared to GCC. Compile this input with `-O3 -mcpu=neoverse-v2 -ffast-math`: ``` __attribute__((aligned(64))) float a[32000],b[32000],c[32000],d[32000],e[32000], aa[256][256],bb[256][256],cc[256][256],tt[256][256]; int dummy(float[32000], float[32000], float[32000], float[32000], float[32000], float[256][256], float[256][256], float[256][256], float); float s173() { int k = 32000/2; for (int nl = 0; nl < 10*100000; nl++) { for (int i = 0; i < 32000/2; i++) { a[i+k] = a[i] + b[i]; } dummy(a, b, c, d, e, aa, bb, cc, 0.); } } ``` Clang's codegen: ``` .LBB0_3: // Parent Loop BB0_2 Depth=1 add x9, x19, x8, lsl #2 add x10, x20, x8, lsl #2 ld1w { z0.s }, p0/z, [x19, x8, lsl #2] ld1w { z2.s }, p0/z, [x20, x8, lsl #2] add x8, x8, x21 ld1w { z1.s }, p0/z, [x9, x28, lsl #2] ld1w { z3.s }, p0/z, [x10, x28, lsl #2] add x10, x9, x26 cmp x8, x22 fadd z0.s, z2.s, z0.s fadd z1.s, z3.s, z1.s st1w { z0.s }, p0, [x9, x23, lsl #2] st1w { z1.s }, p0, [x10, x28, lsl #2] b.ne .LBB0_3 ``` vs. GCC's codegen: ``` .L3: ldr q31, [x20, x0] ldr q30, [x19, x0] fadd v31.4s, v31.4s, v30.4s str q31, [x21, x0] add x0, x0, 16 cmp x0, x28 bne .L3 ``` See also: https://godbolt.org/z/9zs65h3aq Might be caused by the same underlying issue as: https://github.com/llvm/llvm-project/issues/71524

mrdaybird commented 4 months ago

This can now be closed after #95819. Current codegen:(https://godbolt.org/z/o6a8zM3E4):

LBB0_2:                                //   Parent Loop BB0_1 Depth=1
        ldp     q0, q1, [x10]
        ldp     q2, q3, [x8, #-32]
        subs    x9, x9, #16
        fadd    v0.4s, v2.4s, v0.4s
        str     q0, [x10, #64000]
        fadd    v0.4s, v3.4s, v1.4s
        str     q0, [x10, #64016]
        ldp     q0, q1, [x10, #32]
        ldp     q2, q3, [x8], #64
        fadd    v1.4s, v3.4s, v1.4s
        fadd    v0.4s, v2.4s, v0.4s
        str     q1, [x10, #64048]
        str     q0, [x10, #64032]
        add     x10, x10, #64
        b.ne    .LBB0_2

llvm / llvm-project

[AArch64] VLA slower than VLS (tsvc, s173) #71525