Open LebedevRI opened 5 years ago
mentioned in issue llvm/llvm-bugzilla-archive#43882
It just isn't 100% clear to me how all that will work together in the end, it should still result in a few vector loads there, with some shuffling afterwards.
Yes, this is exactly what I meant. At first, the accesses must be sorted, then the cost model shall also check if the loads can be vectorized somehow (shuffled or masked loads for interleaved/non-consecutive memory loads and, possibly, stores) and then vectorization with shuffles if required.
And one more: https://godbolt.org/z/oojLYZ
So far the roadblock appears to be: LLVM_DEBUG(dbgs() << "SLP: Gathering non-consecutive loads.\n"); https://github.com/llvm/llvm-project/blob/ 0de262d7189c68897e8328d891d3daaf3aab3156/llvm/lib/Transforms/Vectorize/ SLPVectorizer.cpp#L2474
This might be something i can fix.
Most probably, can be fixed with https://reviews.llvm.org/D57059
Looking forward to that. Not sure what effect that will have.
Even regardless, i wonder if the load vectorization should be more smart. Instead of immediately giving up (hard-bailing up, it does not even attempt to actually form gather! :/), why not try to look at all loads, see if any of those vectorizable can be shuffled to produce the vector in question.
Also, what about load "widening" - if we load 0'th and 2th element, but not 1 and 3, we'll gather. But there may be deferenceability info present (not nessesairly the existing attribute, it's limited) that would allow to vector-load 4 elements.
Tried the patched version, still not vectorize. There are 2 problems, actually, with this (and other) test.
- Sorting of the instructions when trying to find consecutive stores only.
- The vectorization of loads also requires additional work for non-consecutive loads.
Can we simply vectorize consecutive loads first, before doing everything else and ignoring cost model? (for X86 it says the cost is +1 for such approach). Then we'd only need to fix the insertvector handling (#43882 ) I already looked in that direction and have a patch. I guess not?
No, I don't think this the right approach. Cool, thanks for confirming my understanding.
The somewhat more smart approach would be to replace the entirety of the buildtree_rec code that handles load vectorization and gives up if not consequtive with such "look at all loads, sort them, vectorize, see what vectorized loads contribute to the result, then produce shuffles to get the needed lanes from appropriate loads"
No, we just need to add the additional processing for possibly non-consecutive loads. Sorting, cost model and vectorization must be aware of such possible combinations.
It just isn't 100% clear to me how all that will work together in the end, it should still result in a few vector loads there, with some shuffling afterwards.
And one more: https://godbolt.org/z/oojLYZ
So far the roadblock appears to be: LLVM_DEBUG(dbgs() << "SLP: Gathering non-consecutive loads.\n"); https://github.com/llvm/llvm-project/blob/ 0de262d7189c68897e8328d891d3daaf3aab3156/llvm/lib/Transforms/Vectorize/ SLPVectorizer.cpp#L2474
This might be something i can fix.
Most probably, can be fixed with https://reviews.llvm.org/D57059
Looking forward to that. Not sure what effect that will have.
Even regardless, i wonder if the load vectorization should be more smart. Instead of immediately giving up (hard-bailing up, it does not even attempt to actually form gather! :/), why not try to look at all loads, see if any of those vectorizable can be shuffled to produce the vector in question.
Also, what about load "widening" - if we load 0'th and 2th element, but not 1 and 3, we'll gather. But there may be deferenceability info present (not nessesairly the existing attribute, it's limited) that would allow to vector-load 4 elements.
Tried the patched version, still not vectorize. There are 2 problems, actually, with this (and other) test.
- Sorting of the instructions when trying to find consecutive stores only.
- The vectorization of loads also requires additional work for non-consecutive loads.
Can we simply vectorize consecutive loads first, before doing everything else and ignoring cost model? (for X86 it says the cost is +1 for such approach). Then we'd only need to fix the insertvector handling (#43882 ) I already looked in that direction and have a patch. I guess not?
No, I don't think this the right approach.
The somewhat more smart approach would be to replace the entirety of the buildtree_rec code that handles load vectorization and gives up if not consequtive with such "look at all loads, sort them, vectorize, see what vectorized loads contribute to the result, then produce shuffles to get the needed lanes from appropriate loads"
No, we just need to add the additional processing for possibly non-consecutive loads. Sorting, cost model and vectorization must be aware of such possible combinations.
And one more: https://godbolt.org/z/oojLYZ
So far the roadblock appears to be: LLVM_DEBUG(dbgs() << "SLP: Gathering non-consecutive loads.\n"); https://github.com/llvm/llvm-project/blob/ 0de262d7189c68897e8328d891d3daaf3aab3156/llvm/lib/Transforms/Vectorize/ SLPVectorizer.cpp#L2474
This might be something i can fix.
Most probably, can be fixed with https://reviews.llvm.org/D57059
Looking forward to that. Not sure what effect that will have.
Even regardless, i wonder if the load vectorization should be more smart. Instead of immediately giving up (hard-bailing up, it does not even attempt to actually form gather! :/), why not try to look at all loads, see if any of those vectorizable can be shuffled to produce the vector in question.
Also, what about load "widening" - if we load 0'th and 2th element, but not 1 and 3, we'll gather. But there may be deferenceability info present (not nessesairly the existing attribute, it's limited) that would allow to vector-load 4 elements.
Tried the patched version, still not vectorize. There are 2 problems, actually, with this (and other) test.
- Sorting of the instructions when trying to find consecutive stores only.
- The vectorization of loads also requires additional work for non-consecutive loads.
Can we simply vectorize consecutive loads first, before doing everything else and ignoring cost model? (for X86 it says the cost is +1 for such approach). Then we'd only need to fix the insertvector handling (#43882 ) I already looked in that direction and have a patch. I guess not?
The somewhat more smart approach would be to replace the entirety of the buildtree_rec code that handles load vectorization and gives up if not consequtive with such "look at all loads, sort them, vectorize, see what vectorized loads contribute to the result, then produce shuffles to get the needed lanes from appropriate loads"
And one more: https://godbolt.org/z/oojLYZ
So far the roadblock appears to be: LLVM_DEBUG(dbgs() << "SLP: Gathering non-consecutive loads.\n"); https://github.com/llvm/llvm-project/blob/ 0de262d7189c68897e8328d891d3daaf3aab3156/llvm/lib/Transforms/Vectorize/ SLPVectorizer.cpp#L2474
This might be something i can fix.
Most probably, can be fixed with https://reviews.llvm.org/D57059
Looking forward to that. Not sure what effect that will have.
Even regardless, i wonder if the load vectorization should be more smart. Instead of immediately giving up (hard-bailing up, it does not even attempt to actually form gather! :/), why not try to look at all loads, see if any of those vectorizable can be shuffled to produce the vector in question.
Also, what about load "widening" - if we load 0'th and 2th element, but not 1 and 3, we'll gather. But there may be deferenceability info present (not nessesairly the existing attribute, it's limited) that would allow to vector-load 4 elements.
Tried the patched version, still not vectorize. There are 2 problems, actually, with this (and other) test.
And one more: https://godbolt.org/z/oojLYZ
So far the roadblock appears to be: LLVM_DEBUG(dbgs() << "SLP: Gathering non-consecutive loads.\n"); https://github.com/llvm/llvm-project/blob/ 0de262d7189c68897e8328d891d3daaf3aab3156/llvm/lib/Transforms/Vectorize/ SLPVectorizer.cpp#L2474
This might be something i can fix.
Most probably, can be fixed with https://reviews.llvm.org/D57059
Looking forward to that. Not sure what effect that will have.
Even regardless, i wonder if the load vectorization should be more smart. Instead of immediately giving up (hard-bailing up, it does not even attempt to actually form gather! :/), why not try to look at all loads, see if any of those vectorizable can be shuffled to produce the vector in question.
Also, what about load "widening" - if we load 0'th and 2th element, but not 1 and 3, we'll gather. But there may be deferenceability info present (not nessesairly the existing attribute, it's limited) that would allow to vector-load 4 elements.
So far the roadblock appears to be: LLVM_DEBUG(dbgs() << "SLP: Gathering non-consecutive loads.\n"); https://github.com/llvm/llvm-project/blob/ 0de262d7189c68897e8328d891d3daaf3aab3156/llvm/lib/Transforms/Vectorize/ SLPVectorizer.cpp#L2474
This might be something i can fix.
Most probably, can be fixed with https://reviews.llvm.org/D57059
So far the roadblock appears to be: LLVM_DEBUG(dbgs() << "SLP: Gathering non-consecutive loads.\n"); https://github.com/llvm/llvm-project/blob/0de262d7189c68897e8328d891d3daaf3aab3156/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp#L2474
This might be something i can fix.
For 128-bit registers, then something like https://godbolt.org/z/e90LEC should be produced then?
I see, this is due to the
static cl::opt
For reference, it vectorizes if the numElts=8 https://godbolt.org/z/QshWic But not for 16 https://godbolt.org/z/v4MviU
Extended Description
include
std::array<int, 16> evenodd(const std::array<const int, 16> in) { std::array<int, 16> tmp; for(int i = 0, j = 0; i < 16 && j < 16; i++) { tmp[j] = in[i];
} results in scalar code
while it could be something like
define <16 x i32> @evenodd(<16 x i32> %in) { %t = shufflevector <16 x i32> %in, <16 x i32> undef,
<16 x i32>