Missed optimization opportunity in SLP vectorization


Bugzilla Link	17755
Version	trunk
OS	Linux
Reporter	LLVM Bugzilla Contributor

The following function unrolls the loop by a factor 4 and provides the opportunity to vectorize (SLP) the loop body.

#include <cstdint>
#include <iostream>

void bar(std::uint64_t start, std::uint64_t end, float * __restrict__  c, float * __restrict__ a, float * __restrict__ b)
{
  for ( std::uint64_t i = start ; i < end ; i += 4 ) {
    {
      const std::uint64_t ir0 = (i+0)%4 + 8*((i+0)/4);
      c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];
    }
    {
      const std::uint64_t ir0 = (i+1)%4 + 8*((i+1)/4);
      c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];
    }
    {
      const std::uint64_t ir0 = (i+2)%4 + 8*((i+2)/4);
      c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];
    }
    {
      const std::uint64_t ir0 = (i+3)%4 + 8*((i+3)/4);
      c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];
    }
  }
}

The loop iteration variable and the values for the array access indices for the first 4 loop counts are as follows:

iter 0:     0 1 2 3 
iter 4:     8 9 10 11 
iter 8:     16 17 18 19 
iter 12:     24 25 26 27

For example on an x86 processor with SSE (128 bit SIMD vectors) the loop body could be vectorized into 2 SIMD reads, 1 SIMD add and 1 SIMD store.

With current trunk I tried the following on the above example:

clang++ -emit-llvm -S loop_minimal.cc -std=c++11
opt -O3 -vectorize-slp -S loop_minimal.ll
opt -O3 -loop-vectorize -S loop_minimal.ll
opt -O3 -bb-vectorize -S loop_minimal.ll

All optimization passes miss the opportunity. It seems the SCEV analysis pass doesn't understand modulo arithmetic.

llvm / llvm-project

Missed optimization opportunity in SLP vectorization #18129