llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.43k stars 11.75k forks source link

[mlir][linalg] Tiling: wrong offset computed when modulo operations are present in access maps #111830

Open cferry-AMD opened 1 day ago

cferry-AMD commented 1 day ago

On tiling linalg ops using the tiling interface, we create extract_slice ops that retrieve a slice of the input tensors. These correspond to the footprint of one tile of iterations on the tensor. These slices are defined by an offset, a side and strides. I think the computation of the offset for the slice is incorrect in some cases where the linalg op access maps contain non-monotonic mod ops.

When tiling the following op with scf::tileConsumerAndFuseProducersUsingSCF and a tile size of 1x16:

#map1 = affine_map<(d0, d1) -> ((d0 - 1) mod 3, d1)>
#map2 = affine_map<(d0, d1) -> (d0, d1)>
module {
  func.func @test(%arg1: tensor<3x128xf32>) -> tensor<4x128xf32> {
    %1 = tensor.empty() : tensor<4x128xf32>
    %2 = linalg.generic {indexing_maps = [#map1, #map2], iterator_types = ["parallel", "parallel"]} 
        ins(%arg1 : tensor<3x128xf32>) outs(%1 : tensor<4x128xf32>) {
    ^bb0(%in: f32, %out: f32):
      linalg.yield %in : f32
    } -> tensor<4x128xf32>
    return %2 : tensor<4x128xf32>
  }
}

we obtain this:

#map = affine_map<(d0) -> (d0 - ((d0 - 1) floordiv 3) * 3 - 1)>
#map1 = affine_map<(d0, d1) -> ((d0 - 1) mod 3, d1)>
#map2 = affine_map<(d0, d1) -> (d0, d1)>
module {
  func.func @test(%arg0: tensor<3x128xf32>) -> tensor<4x128xf32> {
    %c16 = arith.constant 16 : index
    %c1 = arith.constant 1 : index
    %c128 = arith.constant 128 : index
    %c4 = arith.constant 4 : index
    %c0 = arith.constant 0 : index
    %0 = tensor.empty() : tensor<4x128xf32>
    %1 = scf.for %arg1 = %c0 to %c4 step %c1 iter_args(%arg2 = %0) -> (tensor<4x128xf32>) {
      %2 = scf.for %arg3 = %c0 to %c128 step %c16 iter_args(%arg4 = %arg2) -> (tensor<4x128xf32>) {
        %3 = affine.apply #map(%arg1)
        %extracted_slice = tensor.extract_slice %arg0[%3, %arg3] [3, 16] [1, 1] : tensor<3x128xf32> to tensor<3x16xf32>
        %extracted_slice_0 = tensor.extract_slice %arg4[%arg1, %arg3] [1, 16] [1, 1] : tensor<4x128xf32> to tensor<1x16xf32>
        %4 = linalg.generic {indexing_maps = [#map1, #map2], iterator_types = ["parallel", "parallel"]} ins(%extracted_slice : tensor<3x16xf32>) outs(%extracted_slice_0 : tensor<1x16xf32>) {
        ^bb0(%in: f32, %out: f32):
          linalg.yield %in : f32
        } -> tensor<1x16xf32>
        %inserted_slice = tensor.insert_slice %4 into %arg4[%arg1, %arg3] [1, 16] [1, 1] : tensor<1x16xf32> into tensor<4x128xf32>
        scf.yield %inserted_slice : tensor<4x128xf32>
      }
      scf.yield %2 : tensor<4x128xf32>
    }
    return %1 : tensor<4x128xf32>
  }
}

Notably, the offset for %extract_slice is computed as such:

#map = affine_map<(d0) -> (d0 - ((d0 - 1) floordiv 3) * 3 - 1)> // (d0 - 1) mod 3
%5 = affine.apply #map(%arg5)
%extracted_slice = tensor.extract_slice %arg0[%3, %arg3] [3, 16] [1, 1] : tensor<3x128xf32> to tensor<3x16xf32>

which, when %arg5 = 0, gives %5 = 2, and with a slice of size 3 of a tensor of slice 3, turns out to be out of bounds.

The problem lies in mlir/lib/Dialect/Linalg/Utils/Utils.cpp:633: we apply the access map to the lower bound of the iteration domain of the tile to get the offset, however this assumes that lower bound is the argmin of the access map for that dimension. For quasi-affine maps containing mod operations like the one in this example, this is not true.

One way to solve this would be to compute the lexmin of the image of one tile by the affine map, and the size could be computed with the lexmax (provided the difference with the lexmin is a constant or is at least bounded). Note that although this would hopefully make the code generated correct, it would also over-approximate the amount of data actually needed from a tensor and so reduce the locality benefit of tiling.

llvmbot commented 1 day ago

@llvm/issue-subscribers-mlir-linalg

Author: Corentin Ferry (cferry-AMD)

On tiling `linalg` ops using the tiling interface, we create `extract_slice` ops that retrieve a slice of the input tensors. These correspond to the footprint of one tile of iterations on the tensor. These slices are defined by an offset, a side and strides. I think the computation of the offset for the slice is incorrect in some cases where the `linalg` op access maps contain non-monotonic `mod` ops. When tiling the following op with `scf::tileConsumerAndFuseProducersUsingSCF` and a tile size of 1x16: ```mlir #map1 = affine_map<(d0, d1) -> ((d0 - 1) mod 3, d1)> #map2 = affine_map<(d0, d1) -> (d0, d1)> module { func.func @test(%arg1: tensor<3x128xf32>) -> tensor<4x128xf32> { %1 = tensor.empty() : tensor<4x128xf32> %2 = linalg.generic {indexing_maps = [#map1, #map2], iterator_types = ["parallel", "parallel"]} ins(%arg1 : tensor<3x128xf32>) outs(%1 : tensor<4x128xf32>) { ^bb0(%in: f32, %out: f32): linalg.yield %in : f32 } -> tensor<4x128xf32> return %2 : tensor<4x128xf32> } } ``` we obtain this: ```mlir #map = affine_map<(d0) -> (d0 - ((d0 - 1) floordiv 3) * 3 - 1)> #map1 = affine_map<(d0, d1) -> ((d0 - 1) mod 3, d1)> #map2 = affine_map<(d0, d1) -> (d0, d1)> module { func.func @test(%arg0: tensor<3x128xf32>) -> tensor<4x128xf32> { %c16 = arith.constant 16 : index %c1 = arith.constant 1 : index %c128 = arith.constant 128 : index %c4 = arith.constant 4 : index %c0 = arith.constant 0 : index %0 = tensor.empty() : tensor<4x128xf32> %1 = scf.for %arg1 = %c0 to %c4 step %c1 iter_args(%arg2 = %0) -> (tensor<4x128xf32>) { %2 = scf.for %arg3 = %c0 to %c128 step %c16 iter_args(%arg4 = %arg2) -> (tensor<4x128xf32>) { %3 = affine.apply #map(%arg1) %extracted_slice = tensor.extract_slice %arg0[%3, %arg3] [3, 16] [1, 1] : tensor<3x128xf32> to tensor<3x16xf32> %extracted_slice_0 = tensor.extract_slice %arg4[%arg1, %arg3] [1, 16] [1, 1] : tensor<4x128xf32> to tensor<1x16xf32> %4 = linalg.generic {indexing_maps = [#map1, #map2], iterator_types = ["parallel", "parallel"]} ins(%extracted_slice : tensor<3x16xf32>) outs(%extracted_slice_0 : tensor<1x16xf32>) { ^bb0(%in: f32, %out: f32): linalg.yield %in : f32 } -> tensor<1x16xf32> %inserted_slice = tensor.insert_slice %4 into %arg4[%arg1, %arg3] [1, 16] [1, 1] : tensor<1x16xf32> into tensor<4x128xf32> scf.yield %inserted_slice : tensor<4x128xf32> } scf.yield %2 : tensor<4x128xf32> } return %1 : tensor<4x128xf32> } } ``` Notably, the offset for `%extract_slice` is computed as such: ```mlir #map = affine_map<(d0) -> (d0 - ((d0 - 1) floordiv 3) * 3 - 1)> // (d0 - 1) mod 3 %5 = affine.apply #map(%arg5) %extracted_slice = tensor.extract_slice %arg0[%3, %arg3] [3, 16] [1, 1] : tensor<3x128xf32> to tensor<3x16xf32> ``` which, when `%arg5 = 0`, gives `%5 = 2`, and with a slice of size 3 of a tensor of slice 3, turns out to be out of bounds. The problem lies in mlir/lib/Dialect/Linalg/Utils/Utils.cpp:633: we apply the access map to the lower bound of the iteration domain of the tile to get the offset, however this assumes that lower bound is the argmin of the access map for that dimension. For quasi-affine maps containing `mod` operations like the one in this example, this is not true. One way to solve this would be to compute the lexmin of the image of one tile by the affine map, and the size could be computed with the lexmax (provided the difference with the lexmin is a constant or is at least bounded). Note that although this would hopefully make the code generated correct, it would also over-approximate the amount of data actually needed from a tensor and so reduce the locality benefit of tiling.