[SLPVectorizer] clang failed vectorize the loop in the form of mixed sub/add

llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.

http://llvm.org

Other

28.69k stars 11.87k forks source link

[SLPVectorizer] clang failed vectorize the loop in the form of mixed sub/add #64982

Open vfdff opened 1 year ago

vfdff commented 1 year ago

test: https://godbolt.org/z/11TbEx119

void sub4x4_dct(int16_t d[16], int16_t dct[16], uint8_t *pix1, uint8_t *pix2 )
{
int16_t tmp[16];

for( int i = 0; i < 4; i++ )
{
    int s03 = d[i*4+0] + d[i*4+3];
    int s12 = d[i*4+1] + d[i*4+2];
    int d03 = d[i*4+0] - d[i*4+3];
    int d12 = d[i*4+1] - d[i*4+2];

    tmp[0*4+i] =   s03 +   s12;
    tmp[1*4+i] = 2*d03 +   d12;
    tmp[2*4+i] =   s03 -   s12;
    tmp[3*4+i] =   d03 - 2*d12;
}

for( int i = 0; i < 4; i++ )
{
    int s03 = tmp[i*4+0] + tmp[i*4+3];
    int s12 = tmp[i*4+1] + tmp[i*4+2];
    int d03 = tmp[i*4+0] - tmp[i*4+3];
    int d12 = tmp[i*4+1] - tmp[i*4+2];

    dct[i*4+0] =   s03 +   s12;
    dct[i*4+1] = 2*d03 +   d12;
    dct[i*4+2] =   s03 -   s12;
    dct[i*4+3] =   d03 - 2*d12;
}
}

vfdff commented 1 year ago

simplified case: https://godbolt.org/z/q59hczjWG

void sub4x4_dct_simple (int16_t *__restrict d,
                    int16_t *__restrict dct, 
                    uint8_t *pix1, uint8_t *pix2 )
{
for( int i = 0; i < 4; i++ )
{
    int s03 = d[i*4+0] + d[i*4+3];
    int s12 = d[i*4+1] + d[i*4+2];
    int d03 = d[i*4+0] - d[i*4+3];
    int d12 = d[i*4+1] - d[i*4+2];

    dct[0*4+i] =   s03 +   s12;
    dct[1*4+i] = 2*d03 +   d12;
    dct[2*4+i] =   s03 -   s12;
    dct[3*4+i] =   d03 - 2*d12;
}
}

vfdff commented 9 months ago

x86 can do SLP with PR76461 , such as https://godbolt.org/z/daoqr7Mnq
- x86: opt -passes=slp-vectorizer,verify -mtriple=x86_64-unknown-linux -S test.ll
- arm64: opt -passes=slp-vectorizer,verify -mtriple=aarch64-unknown-linux -S test.ll

llvmbot commented 9 months ago

@llvm/issue-subscribers-backend-aarch64

Author: Allen (vfdff)

* test: https://godbolt.org/z/11TbEx119 ``` void sub4x4_dct(int16_t d[16], int16_t dct[16], uint8_t *pix1, uint8_t *pix2 ) { int16_t tmp[16]; for( int i = 0; i < 4; i++ ) { int s03 = d[i*4+0] + d[i*4+3]; int s12 = d[i*4+1] + d[i*4+2]; int d03 = d[i*4+0] - d[i*4+3]; int d12 = d[i*4+1] - d[i*4+2]; tmp[0*4+i] = s03 + s12; tmp[1*4+i] = 2*d03 + d12; tmp[2*4+i] = s03 - s12; tmp[3*4+i] = d03 - 2*d12; } for( int i = 0; i < 4; i++ ) { int s03 = tmp[i*4+0] + tmp[i*4+3]; int s12 = tmp[i*4+1] + tmp[i*4+2]; int d03 = tmp[i*4+0] - tmp[i*4+3]; int d12 = tmp[i*4+1] - tmp[i*4+2]; dct[i*4+0] = s03 + s12; dct[i*4+1] = 2*d03 + d12; dct[i*4+2] = s03 - s12; dct[i*4+3] = d03 - 2*d12; } } ```

vfdff commented 9 months ago

it seems a cost model issue(commit d827865e9). It generate SLP when we increase the cost for fadd(Now x86 set cost 2 for double fadd)

+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
@@ -2900,7 +2900,7 @@ InstructionCost AArch64TTIImpl::getArithmeticInstrCost(
         (Ty->getScalarType()->isBFloatTy() && !ST->hasBF16()))
       return 2 * LT.first;
     if (!Ty->getScalarType()->isFP128Ty())
-      return LT.first;
+      return 2 * LT.first;

vfdff commented 9 months ago

or add -aarch64-insert-extract-base-cost=1 for arm:https://godbolt.org/z/eras8WG91

the cost of x86 is 1, related code X86TTIImpl::getScalarizationOverhead.

  // Get the smaller of the legalized or original pow2-extended number of
  // vector elements, which represents the number of unpacks we'll end up
  // performing.
  unsigned NumElts = LT.second.getVectorNumElements();
  unsigned Pow2Elts =
      PowerOf2Ceil(cast<FixedVectorType>(Ty)->getNumElements());
  Cost += (std::min<unsigned>(NumElts, Pow2Elts) - 1) * LT.first;

arm default 2,

unsigned AArch64Subtarget::getVectorInsertExtractBaseCost() const {
if (OverrideVectorInsertExtractBaseCost.getNumOccurrences() > 0)
return OverrideVectorInsertExtractBaseCost;
return VectorInsertExtractBaseCost;
}

vfdff commented 4 months ago

gcc gets new improvement idea record on https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98138.