Missing patterns to canonicalize the vectorized result of tensor.unpack

hanhanW commented 1 month ago

func.func @unpack(%arg0: tensor<1x5x2x64xf32>) -> tensor<2x320xf32> {
  %0 = tensor.empty() : tensor<2x320xf32>
  %unpack = tensor.unpack %arg0 outer_dims_perm = [0, 1] inner_dims_pos = [0, 1] inner_tiles = [2, 64] into %0 : tensor<1x5x2x64xf32> -> tensor<2x320xf32>
  return %unpack : tensor<2x320xf32>
}

In the direct vectorization path, the unpack op is lowered to transfer_read + transpose + shape_cast after tiling. It produces the below snippet. The transpose + shape_cast is a nop because

Transpose op itself is a nop.
The vector is casted back to the original shape.

func.func @unpack_dispatch_0_unpack_f32() attributes {translation_info = #iree_codegen.translation_info<CPUDataTiling>} {
  %cst = arith.constant 0.000000e+00 : f32
  %c320 = arith.constant 320 : index
  %c0 = arith.constant 0 : index
  %0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<1x5x2x64xf32>>
  %1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c0) : !flow.dispatch.tensor<writeonly:tensor<2x320xf32>>
  %workgroup_id_x = hal.interface.workgroup.id[0] : index
  %workgroup_count_x = hal.interface.workgroup.count[0] : index
  %2 = affine.apply affine_map<()[s0] -> (s0 * 64)>()[%workgroup_id_x]
  %3 = affine.apply affine_map<()[s0] -> (s0 * 64)>()[%workgroup_count_x]
  scf.for %arg0 = %2 to %c320 step %3 {
    %4 = affine.apply affine_map<(d0) -> (d0 floordiv 64)>(%arg0)
    %5 = flow.dispatch.tensor.load %0, offsets = [0, %4, 0, 0], sizes = [1, 1, 2, 64], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<1x5x2x64xf32>> -> tensor<1x1x2x64xf32>
    %6 = vector.transfer_read %5[%c0, %c0, %c0, %c0], %cst {in_bounds = [true, true, true, true]} : tensor<1x1x2x64xf32>, vector<1x1x2x64xf32>
    %7 = vector.transpose %6, [0, 2, 1, 3] : vector<1x1x2x64xf32> to vector<1x2x1x64xf32>
    %8 = vector.shape_cast %7 : vector<1x2x1x64xf32> to vector<2x64xf32>
    %9 = tensor.empty() : tensor<2x64xf32>
    %10 = vector.transfer_write %8, %9[%c0, %c0] {in_bounds = [true, true]} : vector<2x64xf32>, tensor<2x64xf32>
    flow.dispatch.tensor.store %10, %1, offsets = [0, %arg0], sizes = [2, 64], strides = [1, 1] : tensor<2x64xf32> -> !flow.dispatch.tensor<writeonly:tensor<2x320xf32>>
  }
  return
}

hanhanW commented 1 month ago

@pashu123 can you help fix it? What we want here is converting a nop transpose op into a shape_cast op. You can add the pattern to https://github.com/llvm/llvm-project/blob/main/mlir/lib/Dialect/Vector/Transforms/LowerVectorTranspose.cpp file; please create a new method (e.g., populateTransposeFoldingPatterns) because we want to use it in vector shape optimization but not transpose lowering.

(We can move it to vector.transpose canonicalization patterns if we find it useful. People have different opinions about the canonical form, so I suggest to put it to LowerVectorTranspose.cpp).

pashu123 commented 1 month ago

Sounds good! Thanks for the info.

iree-org / iree

Missing patterns to canonicalize the vectorized result of tensor.unpack #17593