Clang generates a VLA style vector loop, and GCC a VLS vector loop. It looks like we are about 50% slower as a result. Compile this input with -O3 -mcpu=neoverse-v2 -ffast-math:
__attribute__((aligned(64))) float a[32000],b[32000],c[32000],d[32000],e[32000],
aa[256][256],bb[256][256],cc[256][256],tt[256][256];
int dummy(float[32000], float[32000], float[32000], float[32000], float[32000], float[256][256], float[256][256], float[256][256], float);
float s176()
{
int m = 32000/2;
for (int nl = 0; nl < 4*(100000/32000); nl++) {
for (int j = 0; j < (32000/2); j++) {
for (int i = 0; i < m; i++) {
a[i] += b[i+m-j-1] * c[j];
}
}
dummy(a, b, c, d, e, aa, bb, cc, 0.);
}
}
Clang generates a VLA style vector loop, and GCC a VLS vector loop. It looks like we are about 50% slower as a result. Compile this input with
-O3 -mcpu=neoverse-v2 -ffast-math
:Clang's codegen:
vs. GCC's codegen:
See also: https://godbolt.org/z/64nhv1o6z
TODO: Root cause analysis.