Open newling opened 2 months ago
I think the main concern is whether the generated date layout can be vectorized.
And the linalg.generic's operands seem to be problematic to me.
ins(%extracted_slice_11, %extracted_slice_12 : tensor<1x1x1x4x8xi32>, tensor<1x1x1x1x8x4xi32>) outs(%arg14 : tensor<1x1x4x1x4xi32>)
I think the main concern is whether the generated date layout can be vectorized. ? And the linalg.generic's operands seem to be problematic to me.
ins(%extracted_slice_11, %extracted_slice_12 : tensor<1x1x1x4x8xi32>, tensor<1x1x1x1x8x4xi32>) outs(%arg14 : tensor<1x1x4x1x4xi32>)
It looks ok to me, can you explain your concern? I think it's just a matmul with m=n=4 k=8 on contiguous slices of L1 allocations.
I didn't see the indexing maps for the linalg.generic operands, so not sure what has been done. Is there an implicit collapse of dimension for tensor<1x1x1x1x8x4xi32>
?
Oops, here are the maps:
#map = affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d1 + d4, d6, d2 + d5, d8)>
#map1 = affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d4, d5, d6, d3, d8, d7)>
#map2 = affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d1, d2, d3, d7)>
As an aside, note that %extracted_slice_11
and extra %extracted_slice_12
are contiguous slices, which is exactly what our motivation for packing was in the first place. So that's good.
Now let's look more closely at the linalg.generic (I'm going to try and convince you that it's just a matmul...).
Copying the inner loop from before in a more readable way:
%extracted_slice_11 = tensor.extract_slice
%pack[0, %arg9, %arg13, %arg11, 0] [1, 1, 1, 4, 8] [1, 1, 1, 1, 1] :
tensor<1x3x4x6x8xi32> to tensor<1x1x1x4x8xi32>
%extracted_slice_12 = tensor.extract_slice
%pack_9[%arg9, %arg11, %arg13, 0, 0, 0] [1, 1, 1, 1, 8, 4] [1, 1, 1, 1, 1, 1] :
tensor<3x3x4x1x8x4xi32> to tensor<1x1x1x1x8x4xi32>
%24 = linalg.generic {indexing_maps = [#map, #map1, #map2],
iterator_types = ["p", "p", "p", "p", "r", "r", "r", "p", "r"]}
ins(%extracted_slice_11, %extracted_slice_12 : tensor<1x1x1x4x8xi32>, tensor<1x1x1x1x8x4xi32>)
outs(%arg14 : tensor<1x1x4x1x4xi32>)
attrs = {lowering_config = #config, packing_config = #packingconfig} {
^bb0(%in: i32, %in_13: i32, %out: i32):
%25 = arith.muli %in, %in_13 : i32
%26 = arith.addi %out, %25 : i32
linalg.yield %26 : i32
} -> tensor<1x1x4x1x4xi32>
What are loop counts for each of the dimension d0
through d8
? Matching the dimensions to the tensor we see:
d0: 1 (first dimension of %arg14 is size 1)
d1: 1 (second dimension of %arg14 is size 1)
d2: 4 (third dimension of %arg14 is size 4)
d3: 1 (fourth dimension of %arg14 is size 1)
d4: 1 (first dimension of %extracted_slice_12)
d5: 1 (second dimension of %extracted_slice_12)
d6: 1 (third dimension of %extracted_slice_12)
d7: 4 (final dimension of %arg14)
d8: 8 (final dimension of %extracted_slice_11)
What's interesting here is that d4
and d5
have loop count one. So #map
is actually a trivial map because it is effectively just
#map = affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d1, d6, d2, d8)>
So I think the one remaining problem for vectorization is to get the compiler to canonicalize this linalg.generic and then recognise that it's just a matmul.
So I think the one remaining problem for vectorization is to get the compiler to canonicalize this linalg.generic and then recognise that it's just a matmul.
Update on this: I have a WIP pass which eliminates the singleton dimensions so that after the pass the linalg.generic is clearly a matmul, but vectorization now introduces a broadcast before the vector.contract. Investigating...
I think I understand what is happening here. Can you post the method you are using to drop the unit-dimensions. There is an upstream method that allows you to drop unit dimensions and also control what dimensions are being dropped. If you are using this op
%24 = linalg.generic {indexing_maps = [#map, #map1, #map2],
iterator_types = ["p", "p", "p", "p", "r", "r", "r", "p", "r"]}
ins(%extracted_slice_11, %extracted_slice_12 : tensor<1x1x1x4x8xi32>, tensor<1x1x1x1x8x4xi32>)
outs(%arg14 : tensor<1x1x4x1x4xi32>)
attrs = {lowering_config = #config, packing_config = #packingconfig} {
^bb0(%in: i32, %in_13: i32, %out: i32):
%25 = arith.muli %in, %in_13 : i32
%26 = arith.addi %out, %25 : i32
linalg.yield %26 : i32
} -> tensor<1x1x4x1x4xi32>
dropping the inner unit-dimension of tensor<1x1x4x1x4xi32>
is probably causing the issue. You should be able to control the dimensions that you drop. But before that, Why is the result not tensor<1x1x1x4x4xi32>
Or if you post the IR post vectorization, that will give some clues
So I think the one remaining problem for vectorization is to get the compiler to canonicalize this linalg.generic and then recognise that it's just a matmul.
Update on this: I have a WIP pass which eliminates the singleton dimensions so that after the pass the linalg.generic is clearly a matmul, but vectorization now introduces a broadcast before the vector.contract. Investigating...
I think previously it had compilation issue when lowered to vector.broadcast, but it's good to check if aievec can handle vector.broadcast now.
Just noticed your comment @MaheshRavishankar and linalg-fold-unit-extent-dims
works perfectly, thank you!
The pass I've written is basically the same as linalg-fold-unit-extent-dims
but uses tensor.extract_slice instead of tensor.expand_shape, and for some reason comprehensive bufferization fails with the extract_slice approach.
Removing all unit dimensions is exactly what I want. It isn't enough to just remove the reduction dimensions, because then the broadcasts doesn't get eliminated (vector.contract verifies that all dimensions to appear in either LHS or RHS).
Use packing between L2 and L1 for convolution.
Using upstream MLIR packing I get the following.
So I currently think we don't need any modification to upstream MLIR.
I'll post a PR with my packing config later. It currently results in other issues cropping up (in air: out of tile memory. in objectFifo: some other crash).