llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
29.13k stars 12.01k forks source link

[vectorization] PRE prevent the loop vectorization #53443

Open vfdff opened 2 years ago

vfdff commented 2 years ago

For Livermore Kernel 7, clang don't generate vectorization as too complex arithmetic computation of the inner loop. https://godbolt.org/z/hr7YGoEPq

void kernel_7(void)
{
    for (long l=1 ; l<=loop ; l++ ) {
// #pragma nohazard
// #pragma clang loop vectorize(assume_safety)
        for (long k=0 ; k<n ; k++) {
            x[k] = u[k] + r*( z[k] + r*y[k] ) +
                   t*( u[k+3] + r*( u[k+2] + r*u[k+1] ) +
                      t*( u[k+6] + r*( u[k+5] + r*u[k+4] ) ) );
        }
    } 

    return;
}

When I try to delete part of the computation , either u[k+3] + r*( u[k+2] + r*u[k+1] ) or t*( u[k+6] + r*( u[k+5] + r*u[k+4] ) ) , then the clang can generate vectorization.

llvmbot commented 2 years ago

@llvm/issue-subscribers-backend-aarch64

efriedma-quic commented 2 years ago

Looks like some limitation in the analysis of PHI nodes in the loop vectorizer. "-mllvm -enable-pre=false" enables vectorization.

vfdff commented 2 years ago

Yes, some PHI nodes with *.pre are generated in function eliminatePartiallyRedundantLoad, and they are not IV type node, which prevent the vectorization.

void GVNPass::eliminatePartiallyRedundantLoad(
    LoadInst *Load, AvailValInBlkVect &ValuesPerBlock,
    MapVector<BasicBlock *, Value *> &AvailableLoads) {
  for (const auto &AvailableLoad : AvailableLoads) {
    BasicBlock *UnavailableBlock = AvailableLoad.first;
    Value *LoadPtr = AvailableLoad.second;

    auto *NewLoad =
        new LoadInst(Load->getType(), LoadPtr, Load->getName() + ".pre",
vfdff commented 2 years ago

option "-mllvm -enable-load-in-loop-pre=false" is more precise for this case, but this pass itself seems no conflict to the vectorization, so need loose the PHI node checking condition in vectorization ?

vfdff commented 2 years ago

a more simple test case, https://godbolt.org/z/os31qYh7e

#define LEN_1D 8000ll
#define LEN_2D 128ll

__attribute__((aligned(64))) float a[LEN_1D],b[LEN_1D],c[LEN_1D],d[LEN_1D];

void s212(struct args_t * func_args)
{
// statement reordering dependency needing temporary
#pragma clang loop vectorize(assume_safety)
// #pragma GCC ivdep
   for (int i = 0; i < LEN_1D-1; i++) {
       a[i] *= c[i];
       b[i] += a[i + 1] * d[i];
   }
}
vfdff commented 2 years ago

the issue of above both cases still exsit even add patch on https://reviews.llvm.org/D118642

For 2nd case2, it is unsafe to swap or sink across loop interation

st a[i]
ld a[i+1]   --\
  ------           unsafe to swap or sink, so that may be reason both gcc and llvm don't try to vectorize
st a[i+1]   --/
ld a[i+2]
fhahn commented 2 years ago

This seems related to #53900. With the fixes linked there, LLVM should be able to vectorize kernel_7.

As for s212, we should be able to sink the memory accesses to support the recurrence introduced by GVN PRE (I've got some patches, I'll share them soon).

But for the loop, #pragma clang loop vectorize(assume_safety) may be problematic, because of the dependency between a[I] <-> a[I+1].

vfdff commented 2 years ago

This seems related to #53900. With the fixes linked there, LLVM should be able to vectorize kernel_7.

Thanks very much, and I'am waiting your patch land on upstream, and try a test

fhahn commented 2 years ago

The first loop is vectorized with current main now after b8709a9d03f8 https://godbolt.org/z/KEGGjKhGY

vfdff commented 2 years ago

thanks @fhahn , yes , it works now.

As the 2nd loop is not expect to vectorize? so I thinks this issue can be cloned now.

peterwaller-arm commented 2 years ago

Hi @fhahn and @vfdff. I have submitted a patch which helps with Allen's second loop, it introduces a memory dependence check between the load and store and tries to sink the store. I see this loop vectorize with that patch applied: https://reviews.llvm.org/D137159

vfdff commented 1 year ago
vfdff commented 9 months ago

For the simplified 2nd case s212_1, it looks like similar to s112 in PR71511 , but now llvm still doesn't do vectorization