Open llvmbot opened 11 years ago
Bugzilla Link | 17755 |
Version | trunk |
OS | Linux |
Reporter | LLVM Bugzilla Contributor |
The following function unrolls the loop by a factor 4 and provides the opportunity to vectorize (SLP) the loop body.
#include <cstdint>
#include <iostream>
void bar(std::uint64_t start, std::uint64_t end, float * __restrict__ c, float * __restrict__ a, float * __restrict__ b)
{
for ( std::uint64_t i = start ; i < end ; i += 4 ) {
{
const std::uint64_t ir0 = (i+0)%4 + 8*((i+0)/4);
c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
}
{
const std::uint64_t ir0 = (i+1)%4 + 8*((i+1)/4);
c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
}
{
const std::uint64_t ir0 = (i+2)%4 + 8*((i+2)/4);
c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
}
{
const std::uint64_t ir0 = (i+3)%4 + 8*((i+3)/4);
c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
}
}
}
The loop iteration variable and the values for the array access indices for the first 4 loop counts are as follows:
iter 0: 0 1 2 3
iter 4: 8 9 10 11
iter 8: 16 17 18 19
iter 12: 24 25 26 27
For example on an x86 processor with SSE (128 bit SIMD vectors) the loop body could be vectorized into 2 SIMD reads, 1 SIMD add and 1 SIMD store.
With current trunk I tried the following on the above example:
clang++ -emit-llvm -S loop_minimal.cc -std=c++11
opt -O3 -vectorize-slp -S loop_minimal.ll
opt -O3 -loop-vectorize -S loop_minimal.ll
opt -O3 -bb-vectorize -S loop_minimal.ll
All optimization passes miss the opportunity. It seems the SCEV analysis pass doesn't understand modulo arithmetic.