llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
29.41k stars 12.15k forks source link

[SLPVectorizer] Vectorization of non-consecutive loads #43189

Open LebedevRI opened 5 years ago

LebedevRI commented 5 years ago
Bugzilla Link 43844
Version trunk
OS Linux
CC @alexey-bataev,@dtemirbulatov,@fhahn,@RKSimon,@vporpo

Extended Description

include

std::array<int, 16> evenodd(const std::array<const int, 16> in) { std::array<int, 16> tmp; for(int i = 0, j = 0; i < 16 && j < 16; i++) { tmp[j] = in[i];

    // First fill all even values, then all odd
    j += 2;
    if(j == 16)
        j = 1;
}

return tmp;

} results in scalar code

while it could be something like

define <16 x i32> @​evenodd(<16 x i32> %in) { %t = shufflevector <16 x i32> %in, <16 x i32> undef,

<16 x i32> ret <16 x i32> %t } https://godbolt.org/z/9Mjrqa
LebedevRI commented 3 years ago

mentioned in issue llvm/llvm-bugzilla-archive#43882

alexey-bataev commented 5 years ago

It just isn't 100% clear to me how all that will work together in the end, it should still result in a few vector loads there, with some shuffling afterwards.

Yes, this is exactly what I meant. At first, the accesses must be sorted, then the cost model shall also check if the loads can be vectorized somehow (shuffled or masked loads for interleaved/non-consecutive memory loads and, possibly, stores) and then vectorization with shuffles if required.

LebedevRI commented 5 years ago

And one more: https://godbolt.org/z/oojLYZ

So far the roadblock appears to be: LLVM_DEBUG(dbgs() << "SLP: Gathering non-consecutive loads.\n"); https://github.com/llvm/llvm-project/blob/ 0de262d7189c68897e8328d891d3daaf3aab3156/llvm/lib/Transforms/Vectorize/ SLPVectorizer.cpp#L2474

This might be something i can fix.

Most probably, can be fixed with https://reviews.llvm.org/D57059

Looking forward to that. Not sure what effect that will have.

Even regardless, i wonder if the load vectorization should be more smart. Instead of immediately giving up (hard-bailing up, it does not even attempt to actually form gather! :/), why not try to look at all loads, see if any of those vectorizable can be shuffled to produce the vector in question.

Also, what about load "widening" - if we load 0'th and 2th element, but not 1 and 3, we'll gather. But there may be deferenceability info present (not nessesairly the existing attribute, it's limited) that would allow to vector-load 4 elements.

Tried the patched version, still not vectorize. There are 2 problems, actually, with this (and other) test.

  1. Sorting of the instructions when trying to find consecutive stores only.
  2. The vectorization of loads also requires additional work for non-consecutive loads.

Can we simply vectorize consecutive loads first, before doing everything else and ignoring cost model? (for X86 it says the cost is +1 for such approach). Then we'd only need to fix the insertvector handling (#43882 ) I already looked in that direction and have a patch. I guess not?

No, I don't think this the right approach. Cool, thanks for confirming my understanding.

The somewhat more smart approach would be to replace the entirety of the buildtree_rec code that handles load vectorization and gives up if not consequtive with such "look at all loads, sort them, vectorize, see what vectorized loads contribute to the result, then produce shuffles to get the needed lanes from appropriate loads"

No, we just need to add the additional processing for possibly non-consecutive loads. Sorting, cost model and vectorization must be aware of such possible combinations.

It just isn't 100% clear to me how all that will work together in the end, it should still result in a few vector loads there, with some shuffling afterwards.

alexey-bataev commented 5 years ago

And one more: https://godbolt.org/z/oojLYZ

So far the roadblock appears to be: LLVM_DEBUG(dbgs() << "SLP: Gathering non-consecutive loads.\n"); https://github.com/llvm/llvm-project/blob/ 0de262d7189c68897e8328d891d3daaf3aab3156/llvm/lib/Transforms/Vectorize/ SLPVectorizer.cpp#L2474

This might be something i can fix.

Most probably, can be fixed with https://reviews.llvm.org/D57059

Looking forward to that. Not sure what effect that will have.

Even regardless, i wonder if the load vectorization should be more smart. Instead of immediately giving up (hard-bailing up, it does not even attempt to actually form gather! :/), why not try to look at all loads, see if any of those vectorizable can be shuffled to produce the vector in question.

Also, what about load "widening" - if we load 0'th and 2th element, but not 1 and 3, we'll gather. But there may be deferenceability info present (not nessesairly the existing attribute, it's limited) that would allow to vector-load 4 elements.

Tried the patched version, still not vectorize. There are 2 problems, actually, with this (and other) test.

  1. Sorting of the instructions when trying to find consecutive stores only.
  2. The vectorization of loads also requires additional work for non-consecutive loads.

Can we simply vectorize consecutive loads first, before doing everything else and ignoring cost model? (for X86 it says the cost is +1 for such approach). Then we'd only need to fix the insertvector handling (#43882 ) I already looked in that direction and have a patch. I guess not?

No, I don't think this the right approach.

The somewhat more smart approach would be to replace the entirety of the buildtree_rec code that handles load vectorization and gives up if not consequtive with such "look at all loads, sort them, vectorize, see what vectorized loads contribute to the result, then produce shuffles to get the needed lanes from appropriate loads"

No, we just need to add the additional processing for possibly non-consecutive loads. Sorting, cost model and vectorization must be aware of such possible combinations.

LebedevRI commented 5 years ago

And one more: https://godbolt.org/z/oojLYZ

So far the roadblock appears to be: LLVM_DEBUG(dbgs() << "SLP: Gathering non-consecutive loads.\n"); https://github.com/llvm/llvm-project/blob/ 0de262d7189c68897e8328d891d3daaf3aab3156/llvm/lib/Transforms/Vectorize/ SLPVectorizer.cpp#L2474

This might be something i can fix.

Most probably, can be fixed with https://reviews.llvm.org/D57059

Looking forward to that. Not sure what effect that will have.

Even regardless, i wonder if the load vectorization should be more smart. Instead of immediately giving up (hard-bailing up, it does not even attempt to actually form gather! :/), why not try to look at all loads, see if any of those vectorizable can be shuffled to produce the vector in question.

Also, what about load "widening" - if we load 0'th and 2th element, but not 1 and 3, we'll gather. But there may be deferenceability info present (not nessesairly the existing attribute, it's limited) that would allow to vector-load 4 elements.

Tried the patched version, still not vectorize. There are 2 problems, actually, with this (and other) test.

  1. Sorting of the instructions when trying to find consecutive stores only.
  2. The vectorization of loads also requires additional work for non-consecutive loads.

Can we simply vectorize consecutive loads first, before doing everything else and ignoring cost model? (for X86 it says the cost is +1 for such approach). Then we'd only need to fix the insertvector handling (#43882 ) I already looked in that direction and have a patch. I guess not?

The somewhat more smart approach would be to replace the entirety of the buildtree_rec code that handles load vectorization and gives up if not consequtive with such "look at all loads, sort them, vectorize, see what vectorized loads contribute to the result, then produce shuffles to get the needed lanes from appropriate loads"

alexey-bataev commented 5 years ago

And one more: https://godbolt.org/z/oojLYZ

So far the roadblock appears to be: LLVM_DEBUG(dbgs() << "SLP: Gathering non-consecutive loads.\n"); https://github.com/llvm/llvm-project/blob/ 0de262d7189c68897e8328d891d3daaf3aab3156/llvm/lib/Transforms/Vectorize/ SLPVectorizer.cpp#L2474

This might be something i can fix.

Most probably, can be fixed with https://reviews.llvm.org/D57059

Looking forward to that. Not sure what effect that will have.

Even regardless, i wonder if the load vectorization should be more smart. Instead of immediately giving up (hard-bailing up, it does not even attempt to actually form gather! :/), why not try to look at all loads, see if any of those vectorizable can be shuffled to produce the vector in question.

Also, what about load "widening" - if we load 0'th and 2th element, but not 1 and 3, we'll gather. But there may be deferenceability info present (not nessesairly the existing attribute, it's limited) that would allow to vector-load 4 elements.

Tried the patched version, still not vectorize. There are 2 problems, actually, with this (and other) test.

  1. Sorting of the instructions when trying to find consecutive stores only.
  2. The vectorization of loads also requires additional work for non-consecutive loads.
LebedevRI commented 5 years ago

And one more: https://godbolt.org/z/oojLYZ

So far the roadblock appears to be: LLVM_DEBUG(dbgs() << "SLP: Gathering non-consecutive loads.\n"); https://github.com/llvm/llvm-project/blob/ 0de262d7189c68897e8328d891d3daaf3aab3156/llvm/lib/Transforms/Vectorize/ SLPVectorizer.cpp#L2474

This might be something i can fix.

Most probably, can be fixed with https://reviews.llvm.org/D57059

Looking forward to that. Not sure what effect that will have.

Even regardless, i wonder if the load vectorization should be more smart. Instead of immediately giving up (hard-bailing up, it does not even attempt to actually form gather! :/), why not try to look at all loads, see if any of those vectorizable can be shuffled to produce the vector in question.

Also, what about load "widening" - if we load 0'th and 2th element, but not 1 and 3, we'll gather. But there may be deferenceability info present (not nessesairly the existing attribute, it's limited) that would allow to vector-load 4 elements.

alexey-bataev commented 5 years ago

So far the roadblock appears to be: LLVM_DEBUG(dbgs() << "SLP: Gathering non-consecutive loads.\n"); https://github.com/llvm/llvm-project/blob/ 0de262d7189c68897e8328d891d3daaf3aab3156/llvm/lib/Transforms/Vectorize/ SLPVectorizer.cpp#L2474

This might be something i can fix.

Most probably, can be fixed with https://reviews.llvm.org/D57059

LebedevRI commented 5 years ago

So far the roadblock appears to be: LLVM_DEBUG(dbgs() << "SLP: Gathering non-consecutive loads.\n"); https://github.com/llvm/llvm-project/blob/0de262d7189c68897e8328d891d3daaf3aab3156/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp#L2474

This might be something i can fix.

LebedevRI commented 5 years ago

For 128-bit registers, then something like https://godbolt.org/z/e90LEC should be produced then?

LebedevRI commented 5 years ago

I see, this is due to the

static cl::opt MaxVectorRegSizeOption("slp-max-reg-size", cl::init(128), cl::Hidden, cl::desc("Attempt to vectorize for this register size in bits"));

LebedevRI commented 5 years ago

For reference, it vectorizes if the numElts=8 https://godbolt.org/z/QshWic But not for 16 https://godbolt.org/z/v4MviU