[vectorization] Innefficient use of SIMD instruction on x86 processors

zephyr111 commented 2 years ago

Hello,

The following simple code produces a pretty inefficient assembly code with the flags -O3 -mavx2 -mfma -ffast-math whatever the version of Clang used. This can be seen on GodBolt.

void computeBlockSlow(double* bs, double* ba, double* bb, long si, long sh, long sw, long lda, long ldb)
{
    double s[4] = {0.0};
    long sihw_vect_max = (si*sh*sw)/4*4;

    for(long ihw = 0 ; ihw < sihw_vect_max ; ihw += 4)
        for(long i = 0 ; i < 4 ; ++i)
            s[i] += ba[ihw+i]*bb[ihw+i];

    bs[0] = (s[0] + s[2]) + (s[1] + s[3]);
}

The automatic vectorization produce FMA operation working on XMM registers instead of YMM and it also use vunpckhpd instruction for no apparent reasons. This is the case for all version from Clang 5.0 to Clang 13.0. Note that the use of the __restrict keyword down not visibly change the outcome.

The recent trunk version of Clang on GodBolt (commit 2f18b02d) succeed to use YMM registers but it makes use of many expensive vperm2f128 and vunpcklpd instructions.

This is possible to perform a much better vectorization using SIMD intrinsics. Here is an example (note that the loop should be unrolled about 4 times so to mitigate the latency of the FMA instructions):

#include <x86intrin.h>

void computeBlockFast(double* bs, double* ba, double* bb, long si, long sh, long sw, long lda, long ldb)
{
    __m256d s = _mm256_set1_pd(0);
    long sihw_vect_max = (si*sh*sw)/4*4;

    for(long ihw = 0 ; ihw < sihw_vect_max ; ihw += 4)
        s = _mm256_fmadd_pd(_mm256_loadu_pd(ba+ihw), _mm256_loadu_pd(bb+ihw), s);

    __m128d tmp = _mm_add_pd(_mm256_extractf128_pd(s, 0), _mm256_extractf128_pd(s, 1));
    bs[0] = _mm_cvtsd_f64(_mm_hadd_pd(tmp, tmp));
}

When a register blocking strategy is manually performed, then the generated is even worse. Indeed, it makes use of slow gather instructions instead of packed loads. For more information about this more complex example, please read this Stack-Overflow post.

Note that similar issues appear also with ICC and GCC.

llvmbot commented 2 years ago

@llvm/issue-subscribers-backend-x86

topperc commented 2 years ago

The inner loop of 4 iterations gets fully unrolled before the vectorizer runs. This creates 4 separate scalar loads each moving 4 elements ahead on the next iteration of the outer loop. Along with 4 scalar FMAs.

With llvm trunk, the loop vectorizer vectorizes each of the FMAs using those strided accesses. So one FMA will work element 0, 4, 8, 12. One will work on element 1, 5, 9, 13, etc. All the shuffles are just trying to rearrange the loaded data into that order.

With the inner loop scalar loop removed, the vectorize produces a loop with 4 vector loads and 4 vector FMAs.

llvm / llvm-project

[vectorization] Innefficient use of SIMD instruction on x86 processors #53507