[AArch64][SVE] Case where loop-reroll is useful to vectorize (TSVC s351)

m-saito-fj commented 7 months ago

Original code:

#define LEN 32000
#define LEN2 256
static int ntimes = 200000;

float a[LEN], b[LEN], c[LEN], d[LEN], e[LEN];
float aa[LEN2][LEN2], bb[LEN2][LEN2], cc[LEN2][LEN2], dd[LEN2][LEN2];

int dummy(float[LEN], float[LEN], float[LEN], float[LEN], float[LEN],
          float[LEN2][LEN2], float[LEN2][LEN2], float[LEN2][LEN2], float);

int s351()
{
        float alpha = c[0];
        for (int nl = 0; nl < 8*ntimes; nl++) {
                for (int i = 0; i < LEN; i += 5) {
                        a[i] += alpha * b[i];
                        a[i + 1] += alpha * b[i + 1];
                        a[i + 2] += alpha * b[i + 2];
                        a[i + 3] += alpha * b[i + 3];
                        a[i + 4] += alpha * b[i + 4];
                }
                dummy(a, b, c, d, e, aa, bb, cc, 0.);
        }
        return 0;
}

Option: -Ofast -march=armv8.2-a+sve

In the original code, only interleave is applied in loop-vectorize. (VF=1, IC=2)

Code for manually rerolling the original code:

int s351()
{
        float alpha = c[0];
        for (int nl = 0; nl < 8*ntimes; nl++) {
                for (int i = 0; i < LEN; i += 1) {
                        a[i] += alpha * b[i];
                }
                dummy(a, b, c, d, e, aa, bb, cc, 0.);
        }
        return 0;
}

In the manually rerolled code, vectorization is applied in loop-vectorize (VF=vscale x 4, IC=2) Register an Issue as a valid case for Loop-reroll.

llvmbot commented 7 months ago

@llvm/issue-subscribers-backend-aarch64

Author: m-saito-fj (m-saito-fj)

Original code: ```c #define LEN 32000 #define LEN2 256 static int ntimes = 200000; float a[LEN], b[LEN], c[LEN], d[LEN], e[LEN]; float aa[LEN2][LEN2], bb[LEN2][LEN2], cc[LEN2][LEN2], dd[LEN2][LEN2]; int dummy(float[LEN], float[LEN], float[LEN], float[LEN], float[LEN], float[LEN2][LEN2], float[LEN2][LEN2], float[LEN2][LEN2], float); int s351() { float alpha = c[0]; for (int nl = 0; nl < 8*ntimes; nl++) { for (int i = 0; i < LEN; i += 5) { a[i] += alpha * b[i]; a[i + 1] += alpha * b[i + 1]; a[i + 2] += alpha * b[i + 2]; a[i + 3] += alpha * b[i + 3]; a[i + 4] += alpha * b[i + 4]; } dummy(a, b, c, d, e, aa, bb, cc, 0.); } return 0; } ``` Option: `-Ofast -march=armv8.2-a+sve` In the original code, only interleave is applied in loop-vectorize. (VF=1, IC=2) Code for manually rerolling the original code: ```c int s351() { float alpha = c[0]; for (int nl = 0; nl < 8*ntimes; nl++) { for (int i = 0; i < LEN; i += 1) { a[i] += alpha * b[i]; } dummy(a, b, c, d, e, aa, bb, cc, 0.); } return 0; } ``` In the manually rerolled code, vectorization is applied in loop-vectorize (VF=vscale x 4, IC=2) Register an Issue as a valid case for Loop-reroll.

davemgreen commented 7 months ago

We've seen a few other cases of this happening, usually is SLP vectorizes. I think it would make sense to attempt to recognize it in the loop vectorizer instead through a vplan transform, so that it can get a better cost model and produce better code.

llvm / llvm-project

[AArch64][SVE] Case where loop-reroll is useful to vectorize (TSVC s351) #82218