iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.49k stars 556 forks source link

Missing patterns to canonicalize the vectorized result of tensor.unpack #17593

Open hanhanW opened 1 month ago

hanhanW commented 1 month ago
func.func @unpack(%arg0: tensor<1x5x2x64xf32>) -> tensor<2x320xf32> {
  %0 = tensor.empty() : tensor<2x320xf32>
  %unpack = tensor.unpack %arg0 outer_dims_perm = [0, 1] inner_dims_pos = [0, 1] inner_tiles = [2, 64] into %0 : tensor<1x5x2x64xf32> -> tensor<2x320xf32>
  return %unpack : tensor<2x320xf32>
}

In the direct vectorization path, the unpack op is lowered to transfer_read + transpose + shape_cast after tiling. It produces the below snippet. The transpose + shape_cast is a nop because

  1. Transpose op itself is a nop.
  2. The vector is casted back to the original shape.
func.func @unpack_dispatch_0_unpack_f32() attributes {translation_info = #iree_codegen.translation_info<CPUDataTiling>} {
  %cst = arith.constant 0.000000e+00 : f32
  %c320 = arith.constant 320 : index
  %c0 = arith.constant 0 : index
  %0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<1x5x2x64xf32>>
  %1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c0) : !flow.dispatch.tensor<writeonly:tensor<2x320xf32>>
  %workgroup_id_x = hal.interface.workgroup.id[0] : index
  %workgroup_count_x = hal.interface.workgroup.count[0] : index
  %2 = affine.apply affine_map<()[s0] -> (s0 * 64)>()[%workgroup_id_x]
  %3 = affine.apply affine_map<()[s0] -> (s0 * 64)>()[%workgroup_count_x]
  scf.for %arg0 = %2 to %c320 step %3 {
    %4 = affine.apply affine_map<(d0) -> (d0 floordiv 64)>(%arg0)
    %5 = flow.dispatch.tensor.load %0, offsets = [0, %4, 0, 0], sizes = [1, 1, 2, 64], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<1x5x2x64xf32>> -> tensor<1x1x2x64xf32>
    %6 = vector.transfer_read %5[%c0, %c0, %c0, %c0], %cst {in_bounds = [true, true, true, true]} : tensor<1x1x2x64xf32>, vector<1x1x2x64xf32>
    %7 = vector.transpose %6, [0, 2, 1, 3] : vector<1x1x2x64xf32> to vector<1x2x1x64xf32>
    %8 = vector.shape_cast %7 : vector<1x2x1x64xf32> to vector<2x64xf32>
    %9 = tensor.empty() : tensor<2x64xf32>
    %10 = vector.transfer_write %8, %9[%c0, %c0] {in_bounds = [true, true]} : vector<2x64xf32>, tensor<2x64xf32>
    flow.dispatch.tensor.store %10, %1, offsets = [0, %arg0], sizes = [2, 64], strides = [1, 1] : tensor<2x64xf32> -> !flow.dispatch.tensor<writeonly:tensor<2x320xf32>>
  }
  return
}
hanhanW commented 1 month ago

@pashu123 can you help fix it? What we want here is converting a nop transpose op into a shape_cast op. You can add the pattern to https://github.com/llvm/llvm-project/blob/main/mlir/lib/Dialect/Vector/Transforms/LowerVectorTranspose.cpp file; please create a new method (e.g., populateTransposeFoldingPatterns) because we want to use it in vector shape optimization but not transpose lowering.

(We can move it to vector.transpose canonicalization patterns if we find it useful. People have different opinions about the canonical form, so I suggest to put it to LowerVectorTranspose.cpp).

pashu123 commented 1 month ago

Sounds good! Thanks for the info.