llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
29.24k stars 12.07k forks source link

Missed vectorization opportunity #43493

Open davidbolvansky opened 4 years ago

davidbolvansky commented 4 years ago
Bugzilla Link 44148
Version trunk
OS Linux
CC @topperc,@fhahn,@hidekisaito,@RKSimon,@rotateright

Extended Description

#define N 512

int a[N], b[N];

int
foo (int aval)
{
  int res = 0;
  for (int i=0; i<N; i++)
  {
    if (a[i] != 0)
      res = aval;
  }
  return res;
}

Clang -O3 -march=haswell: loop not vectorized [-Rpass-missed=loop-vectorize]

foo(int): # @&#8203;foo(int)
  xor eax, eax
  mov rcx, -2048
.LBB0_1: # =>This Inner Loop Header: Depth=1
  mov edx, dword ptr [rcx + a+2052]
  or edx, dword ptr [rcx + a+2048]
  or edx, dword ptr [rcx + a+2056]
  or edx, dword ptr [rcx + a+2060]
  or edx, dword ptr [rcx + a+2064]
  or edx, dword ptr [rcx + a+2068]
  or edx, dword ptr [rcx + a+2072]
  or edx, dword ptr [rcx + a+2076]
  cmovne eax, edi
  add rcx, 32
  jne .LBB0_1
  ret
a:
  .zero 2048

b:
  .zero 2048

ICC -O3 -march=haswell:

foo(int):
        xor       eax, eax                                      #&#8203;10.11
        vpxor     ymm1, ymm1, ymm1                              #&#8203;13.17
        xor       edx, edx                                      #&#8203;11.3
        vpcmpeqd  ymm0, ymm0, ymm0                              #&#8203;10.11
..B1.2:                         # Preds ..B1.2 ..B1.1
        vpcmpeqd  ymm2, ymm1, YMMWORD PTR [a+rdx*4]             #&#8203;13.17
        add       rdx, 8                                        #&#8203;11.3
        vpxor     ymm3, ymm2, ymm0                              #&#8203;10.11
        vmovmskps ecx, ymm3                                     #&#8203;10.11
        or        eax, ecx                                      #&#8203;10.11
        cmp       rdx, 512                                      #&#8203;11.3
        jb        ..B1.2        # Prob 99%                      #&#8203;11.3
        test      eax, eax                                      #&#8203;16.10
        cmovne    eax, edi                                      #&#8203;16.10
        vzeroupper                                              #&#8203;16.10
        ret                                                     #&#8203;16.10
a:
b:

https://godbolt.org/z/AjRQNY

This loop is similar, also not vectorized:

int
foo (int aval)
{
  int res = 0;
  for (int i=0; i<N; i++)
  {
    if (a[i] != 0)
      res = b[i];
  }
  return res;
}

https://godbolt.org/z/AU_5rf

hidekisaito commented 4 years ago

In OpenMP terminology, this is called conditional lastprivate. It's best to think separately from conditional reduction ----- but I admit that our initial proposal to OpenMP on the conditional lastprivate included "assignment reduction" perspective.

Took us a while to correctly implement this in ICC. Things to watch out for: 1) assignment under condition may not happen at all. 2) assigned value may be used inside the loop (use is still dominated by assignment). 3) need to find out which element in the vector of "privatized" res is last assigned.

Beyond the basic implementation, optimizations are possible for some cases: The first example here, ICC generated ASM code is just checking whether the condition becomes true inside the loop and the actual assignment happens through cmovne after the loop. The same technique cannot be applied when the assigned value is b[i].

davidbolvansky commented 4 years ago

Quite surprised this loop is not vectorized too:

#define BONUS 5
#define T 24

int
foo (int aval)
{
  int res = 0;
  for (int i=0; i<N; i++)
  {
    if (a[i] > T)
      res += BONUS /* bonus */;
  }
  return res;
}

if #define BONUS 1, code is vectorized.. https://godbolt.org/z/SyjmPf

davidbolvansky commented 2 years ago

First example (res = aval) is vectorized now.