Open dcaballe opened 1 year ago
Thanks for this repro!
This is not a 1-D contiguous load and such loads are currently marked as gather load: getTensorExtractMemoryAccessPattern. This shouldn't be too difficult to support.
-Andrzej
Is there something that needs to be done here?
Is there something that needs to be done here?
Apologies, I dropped the ball there :( Thanks for the reminder - I should have some spare cycles in the next week or two to properly triage this.
@banach-space, I was having a look at the memory access pattern code you attached above, shouldn't this be vector<1x2x4xi32>
or vector<2x1x4xi32>
or am I missing out on something crucial?
@banach-space , If you haven't started working on this already, I'd like to help by creating a patch for it during my spare time.
I am thinking of starting with generating vector.transfer_read
for the following minimal example
func.func @example_check(%arg0: tensor<80x16xf32>, %extracted_slice : tensor<4x4xf32>) -> tensor<4x4xf32> {
%1 = linalg.generic {
indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>],
iterator_types = ["parallel", "parallel"]
} outs(%extracted_slice : tensor<4x4xf32>) {
^bb0(%out: f32):
%2 = linalg.index 0 : index
%3 = linalg.index 1 : index
%extracted = tensor.extract %arg0[%2, %3] : tensor<80x16xf32>
linalg.yield %extracted : f32
} -> tensor<4x4xf32>
return %1 : tensor<4x4xf32>
}
which currently gets lowered to
func.func @example_check(%arg0: tensor<80x16xf32>, %arg1: tensor<4x4xf32>) -> tensor<4x4xf32> {
%cst = arith.constant dense<[0, 1, 2, 3]> : vector<4xindex>
%cst_0 = arith.constant dense<true> : vector<4x4xi1>
%cst_1 = arith.constant dense<0.000000e+00> : vector<4x4xf32>
%c0 = arith.constant 0 : index
%cst_2 = arith.constant dense<16> : vector<4x4xindex>
%0 = vector.broadcast %cst : vector<4xindex> to vector<4x4xindex>
%1 = arith.muli %0, %cst_2 : vector<4x4xindex>
%2 = vector.transpose %1, [1, 0] : vector<4x4xindex> to vector<4x4xindex>
%3 = vector.broadcast %cst : vector<4xindex> to vector<4x4xindex>
%4 = arith.addi %3, %2 : vector<4x4xindex>
%5 = vector.gather %arg0[%c0, %c0] [%4], %cst_0, %cst_1 : tensor<80x16xf32>, vector<4x4xindex>, vector<4x4xi1>, vector<4x4xf32> into vector<4x4xf32>
%6 = vector.transfer_write %5, %arg1[%c0, %c0] {in_bounds = [true, true]} : vector<4x4xf32>, tensor<4x4xf32>
return %6 : tensor<4x4xf32>
}
Edit: I have created a first draft pull request for this and would love to get some feedback. Thanks!
Hi @meshtag , thanks for looking into this and apologies for not responding earlier - I was a bit overwhelmed with other activities.
@banach-space, I was having a look at the memory access pattern code you attached above, shouldn't this be
vector<1x2x4xi32>
orvector<2x1x4xi32>
or am I missing out on something crucial?
Note that that method analyses tensor.extract
Ops - at that level there are no vectors. There are also no n-D tensors, hence "vector" rather than "tensor" in:
// * an n-D vector, like`tensor<1x2x4xi32` or`tensor<2x1x4xi32>"
Naming is hard :/ Here's very small update that will hopefully disambiguate things - https://github.com/llvm/llvm-project/pull/76797.
As for the original issue, this is the root cause leading to gather loads:
%11 = linalg.index 0 : index
%12 = arith.divui %11, %c16 : index
%13 = linalg.index 2 : index
%extracted = tensor.extract %6[%12, %c0, %c0, %13] : tensor<64x1x1x384xf32>
Specifically, %12 = arith.divui %11, %c16 : index
. That's a constant for 16 iterations in the leading dim, but then it goes up by 1. That's not really a contiguous access :( It's not clear to me that the vectoriser would ever be able to do anything clever here.
Instead of trying to fix this in the vectoriser, we should tile the following operation before vectorisation:
%10 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>, affine_map<(d0, d1, d2) -> (d0, d1, d2)>], iterator_types = ["parallel", "parallel", "parallel"]} ins(%9 : tensor<1024x384x384xf32>) outs(%7 : tensor<1024x384x384xf32>) {
^bb0(%in: f32, %out: f32):
%11 = linalg.index 0 : index
%12 = arith.divui %11, %c16 : index
%13 = linalg.index 2 : index
%extracted = tensor.extract %6[%12, %c0, %c0, %13] : tensor<64x1x1x384xf32>
%14 = arith.subf %cst_0, %extracted : f32
%15 = arith.mulf %14, %cst_1 : f32
%16 = arith.divf %in, %cst_2 : f32
%17 = arith.addf %16, %15 : f32
linalg.yield %17 : f32
} -> tensor<1024x384x384xf32>
@dcaballe WDYT?
@banach-space , If you haven't started working on this already, I'd like to help by creating a patch for it during my spare time.
I am thinking of starting with generating vector.transfer_read for the following minimal example
You are proposing to extend the vectoriser so that it can generate contiguous loads for reading n-D vectors - 2D in your example. TBH, it's not obvious to me whether that would help for this specific issue. But I might be missing something. Do you have any other example where this would be helpful?
@dcaballe If you agree that the access pattern in your example is indeed a gather load, then I suggest closing this and to continue the discussion on contiguous loads of n-D vectors elsewhere (perhaps https://github.com/llvm/llvm-project/pull/76436).
Naming is hard :/ Here's very small update that will hopefully disambiguate things - https://github.com/llvm/llvm-project/pull/76797.
Thanks!
You are proposing to extend the vectoriser so that it can generate contiguous loads for reading n-D vectors - 2D in your example. TBH, it's not obvious to me whether that would help for this specific issue. But I might be missing something. Do you have any other example where this would be helpful?
I did not intend to solve this issue completely with that PR (and I agree that the change would not really benefit this case atm). To me, it seemed like the next simplest case (in continuation to what exists already) where we can try to get vector.transfer_read
instead of vector.gather
. There is a scenario in one of my downstream projects where we would benefit from that change directly. It's still a work in progress though.
The
tensor.extract
below has a contiguous memory access pattern along the adjacent dimension. We should improve the analysis to generate a contiguous vector load:Repro:
FYI: @banach-space