llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
29.45k stars 12.17k forks source link

Missed optimization opportunity in SLP vectorization #18129

Open llvmbot opened 11 years ago

llvmbot commented 11 years ago
Bugzilla Link 17755
Version trunk
OS Linux
Reporter LLVM Bugzilla Contributor
llvmbot commented 11 years ago

The following function unrolls the loop by a factor 4 and provides the opportunity to vectorize (SLP) the loop body.

#include <cstdint>
#include <iostream>

void bar(std::uint64_t start, std::uint64_t end, float * __restrict__  c, float * __restrict__ a, float * __restrict__ b)
{
  for ( std::uint64_t i = start ; i < end ; i += 4 ) {
    {
      const std::uint64_t ir0 = (i+0)%4 + 8*((i+0)/4);
      c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];
    }
    {
      const std::uint64_t ir0 = (i+1)%4 + 8*((i+1)/4);
      c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];
    }
    {
      const std::uint64_t ir0 = (i+2)%4 + 8*((i+2)/4);
      c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];
    }
    {
      const std::uint64_t ir0 = (i+3)%4 + 8*((i+3)/4);
      c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];
    }
  }
} 

The loop iteration variable and the values for the array access indices for the first 4 loop counts are as follows:

iter 0:     0 1 2 3 
iter 4:     8 9 10 11 
iter 8:     16 17 18 19 
iter 12:     24 25 26 27 

For example on an x86 processor with SSE (128 bit SIMD vectors) the loop body could be vectorized into 2 SIMD reads, 1 SIMD add and 1 SIMD store.

With current trunk I tried the following on the above example:

clang++ -emit-llvm -S loop_minimal.cc -std=c++11
opt -O3 -vectorize-slp -S loop_minimal.ll
opt -O3 -loop-vectorize -S loop_minimal.ll
opt -O3 -bb-vectorize -S loop_minimal.ll

All optimization passes miss the opportunity. It seems the SCEV analysis pass doesn't understand modulo arithmetic.