On tiling linalg ops using the tiling interface, we create extract_slice ops that retrieve a slice of the input tensors. These correspond to the footprint of one tile of iterations on the tensor. These slices are defined by an offset, a side and strides. I think the computation of the offset for the slice is incorrect in some cases where the linalg op access maps contain non-monotonic mod ops.
When tiling the following op with scf::tileConsumerAndFuseProducersUsingSCF and a tile size of 1x16:
which, when %arg5 = 0, gives %5 = 2, and with a slice of size 3 of a tensor of slice 3, turns out to be out of bounds.
The problem lies in mlir/lib/Dialect/Linalg/Utils/Utils.cpp:633: we apply the access map to the lower bound of the iteration domain of the tile to get the offset, however this assumes that lower bound is the argmin of the access map for that dimension. For quasi-affine maps containing mod operations like the one in this example, this is not true.
One way to solve this would be to compute the lexmin of the image of one tile by the affine map, and the size could be computed with the lexmax (provided the difference with the lexmin is a constant or is at least bounded). Note that although this would hopefully make the code generated correct, it would also over-approximate the amount of data actually needed from a tensor and so reduce the locality benefit of tiling.
On tiling `linalg` ops using the tiling interface, we create `extract_slice` ops that retrieve a slice of the input tensors. These correspond to the footprint of one tile of iterations on the tensor. These slices are defined by an offset, a side and strides. I think the computation of the offset for the slice is incorrect in some cases where the `linalg` op access maps contain non-monotonic `mod` ops.
When tiling the following op with `scf::tileConsumerAndFuseProducersUsingSCF` and a tile size of 1x16:
```mlir
#map1 = affine_map<(d0, d1) -> ((d0 - 1) mod 3, d1)>
#map2 = affine_map<(d0, d1) -> (d0, d1)>
module {
func.func @test(%arg1: tensor<3x128xf32>) -> tensor<4x128xf32> {
%1 = tensor.empty() : tensor<4x128xf32>
%2 = linalg.generic {indexing_maps = [#map1, #map2], iterator_types = ["parallel", "parallel"]}
ins(%arg1 : tensor<3x128xf32>) outs(%1 : tensor<4x128xf32>) {
^bb0(%in: f32, %out: f32):
linalg.yield %in : f32
} -> tensor<4x128xf32>
return %2 : tensor<4x128xf32>
}
}
```
we obtain this:
```mlir
#map = affine_map<(d0) -> (d0 - ((d0 - 1) floordiv 3) * 3 - 1)>
#map1 = affine_map<(d0, d1) -> ((d0 - 1) mod 3, d1)>
#map2 = affine_map<(d0, d1) -> (d0, d1)>
module {
func.func @test(%arg0: tensor<3x128xf32>) -> tensor<4x128xf32> {
%c16 = arith.constant 16 : index
%c1 = arith.constant 1 : index
%c128 = arith.constant 128 : index
%c4 = arith.constant 4 : index
%c0 = arith.constant 0 : index
%0 = tensor.empty() : tensor<4x128xf32>
%1 = scf.for %arg1 = %c0 to %c4 step %c1 iter_args(%arg2 = %0) -> (tensor<4x128xf32>) {
%2 = scf.for %arg3 = %c0 to %c128 step %c16 iter_args(%arg4 = %arg2) -> (tensor<4x128xf32>) {
%3 = affine.apply #map(%arg1)
%extracted_slice = tensor.extract_slice %arg0[%3, %arg3] [3, 16] [1, 1] : tensor<3x128xf32> to tensor<3x16xf32>
%extracted_slice_0 = tensor.extract_slice %arg4[%arg1, %arg3] [1, 16] [1, 1] : tensor<4x128xf32> to tensor<1x16xf32>
%4 = linalg.generic {indexing_maps = [#map1, #map2], iterator_types = ["parallel", "parallel"]} ins(%extracted_slice : tensor<3x16xf32>) outs(%extracted_slice_0 : tensor<1x16xf32>) {
^bb0(%in: f32, %out: f32):
linalg.yield %in : f32
} -> tensor<1x16xf32>
%inserted_slice = tensor.insert_slice %4 into %arg4[%arg1, %arg3] [1, 16] [1, 1] : tensor<1x16xf32> into tensor<4x128xf32>
scf.yield %inserted_slice : tensor<4x128xf32>
}
scf.yield %2 : tensor<4x128xf32>
}
return %1 : tensor<4x128xf32>
}
}
```
Notably, the offset for `%extract_slice` is computed as such:
```mlir
#map = affine_map<(d0) -> (d0 - ((d0 - 1) floordiv 3) * 3 - 1)> // (d0 - 1) mod 3
%5 = affine.apply #map(%arg5)
%extracted_slice = tensor.extract_slice %arg0[%3, %arg3] [3, 16] [1, 1] : tensor<3x128xf32> to tensor<3x16xf32>
```
which, when `%arg5 = 0`, gives `%5 = 2`, and with a slice of size 3 of a tensor of slice 3, turns out to be out of bounds.
The problem lies in mlir/lib/Dialect/Linalg/Utils/Utils.cpp:633: we apply the access map to the lower bound of the iteration domain of the tile to get the offset, however this assumes that lower bound is the argmin of the access map for that dimension. For quasi-affine maps containing `mod` operations like the one in this example, this is not true.
One way to solve this would be to compute the lexmin of the image of one tile by the affine map, and the size could be computed with the lexmax (provided the difference with the lexmin is a constant or is at least bounded). Note that although this would hopefully make the code generated correct, it would also over-approximate the amount of data actually needed from a tensor and so reduce the locality benefit of tiling.
On tiling
linalg
ops using the tiling interface, we createextract_slice
ops that retrieve a slice of the input tensors. These correspond to the footprint of one tile of iterations on the tensor. These slices are defined by an offset, a side and strides. I think the computation of the offset for the slice is incorrect in some cases where thelinalg
op access maps contain non-monotonicmod
ops.When tiling the following op with
scf::tileConsumerAndFuseProducersUsingSCF
and a tile size of 1x16:we obtain this:
Notably, the offset for
%extract_slice
is computed as such:which, when
%arg5 = 0
, gives%5 = 2
, and with a slice of size 3 of a tensor of slice 3, turns out to be out of bounds.The problem lies in mlir/lib/Dialect/Linalg/Utils/Utils.cpp:633: we apply the access map to the lower bound of the iteration domain of the tile to get the offset, however this assumes that lower bound is the argmin of the access map for that dimension. For quasi-affine maps containing
mod
operations like the one in this example, this is not true.One way to solve this would be to compute the lexmin of the image of one tile by the affine map, and the size could be computed with the lexmax (provided the difference with the lexmin is a constant or is at least bounded). Note that although this would hopefully make the code generated correct, it would also over-approximate the amount of data actually needed from a tensor and so reduce the locality benefit of tiling.