llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
27.92k stars 11.53k forks source link

[NEON] Interleave selection problem #56239

Open erickq opened 2 years ago

erickq commented 2 years ago

Recently, I was working on software optimization. Recently, I found that armclang performed better than clang in the following program. After static comparison, I found that armclang interleave was set to 2. and clang will be set to 1.

#include <cstdlib>
#include <iostream>

using namespace std;

#define Nx 4
#define Ny 200
#define Nz 200

#define INDEX3D_wDim(i, j, k, dimX, dimY, dimZ)                                \
  (i) * (dimY) * (dimZ) + (j) * (dimZ) + (k)

void test(float *Hx, float *Hy, float *Hz, const float *Ex, const float *Ey,
             const float *Ez, const float *cLx, const float *cLy,
             const float *cLz, const float *cRx, const float *cRy,
             const float *cRz) {
  for (int i = 1; i < Nx - 1; ++i) {
    for (int j = 1; j < Ny - 1; ++j) {
      int ij = INDEX3D_wDim(i, j, 0, Nx, Ny, Nz);

      for (int k = 1; k < Nz - 1; ++k) {

        int ijk = ij + k;
        int i_j1k = ijk + Nz;
        int ij_k1 = ijk + Nz + 1;
        float dzy = Ez[i_j1k] - Ez[ijk];
        float dyz = Ey[ij_k1] - Ey[ijk];

        Hx[ijk] = cLx[ijk] * Hx[ijk] + cRx[ijk] * (dyz - dzy);
      }
    }
  }
}

Run the -mllvm -small-loop-cost=26 options command to set the interleave count to 2. However, the default value of smallloopcost is 20.

Similarly, when SmallLoopCost is set to 20, armclang does not set interleave count to 2. Note that the performance of this test case deteriorates when interleave conut is set to 1.

My question is, is SmallLoopCost too conservative? The default value can be 25 or 30.

Please help me.

fhahn commented 2 years ago

cc @sdesmalen-arm @david-arm who may be able to help

david-arm commented 2 years ago

Hi @erickq I tried compiling that function above using the latest armclang and the command "armclang -O3 -mcpu=neoverse-v1 -S /tmp/foo.cpp", but I didn't see any interleaving happening - I just saw a single st1w and whilelo in the loop. Can you confirm what command you used and which version of armclang?

erickq commented 2 years ago

thx @david-arm. armclang++ -O3 -march=armv8-a -msve-vector-bits=128 test.cpp

armclang vesion: Arm C/C++/Fortran Compiler version 22.0.1 (build number 1630) (based on LLVM 13.0.0)

david-arm commented 2 years ago

Hi @erickq, so this is actually a NEON issue because you haven't specified the SVE feature in the march flag. In order to use SVE you have to build with the command 'clang++ -O3 -march=armv8-a+sve -msve-vector-bits=128 test.cpp`.

erickq commented 2 years ago

@david-arm Sorry, it's a neon issue. I made a mistake. Did you reproduce this problem in armclang?