[CPU][DT] Enable vectorization of pack/unpack dispatches for targets without masking support

We ran into a regression in https://github.com/openxla/iree/pull/16286#issuecomment-1925567807 that impacted DeepLabV3 on Pixel 6. The regression was gone when reverting a PR that dropped some unit dims from vector operations: https://github.com/llvm/llvm-project/pull/79752

Investigating this further, I only found IR differences in the following dispatch (and similar ones) with a tensor.unpack + generic op with dynamic shapes + tensor.pack:

hal.executable public @main_dispatch_84 {
  hal.executable.variant public @system_elf_arm_64 target(<"llvm-cpu", "system-elf-arm_64", {cpu = "generic", cpu_features = "+neon,+dotprod,+reserve-x18", data_layout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128", link_embedded = false, native_vector_size = 16 : index, target_triple = "aarch64-none-linux-android34", ukernels = "all"}>) {
    hal.executable.export public @main_dispatch_84_unpack_generic_1089x32_f32_pack ordinal(0) layout(#hal.pipeline.layout<push_constants = 0, sets = [<0, bindings = [<0, storage_buffer, ReadOnly>, <1, storage_buffer>]>]>) {
    ^bb0(%arg0: !hal.device):
      %x, %y, %z = flow.dispatch.workgroup_count_from_slice
      hal.return %x, %y, %z : index, index, index
    }
    builtin.module {
      func.func @main_dispatch_84_unpack_generic_1089x32_f32_pack() {
        %cst = arith.constant dense_resource<__elided__> : tensor<32xf32>
        %cst_0 = arith.constant 0.000000e+00 : f32
        %c0 = arith.constant 0 : index
        %c140288 = arith.constant 140288 : index
        %c279680 = arith.constant 279680 : index
        %0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<137x4x8x8xf32>>
        %1 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c140288) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<1089x32xf32>>
        %2 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c279680) : !flow.dispatch.tensor<writeonly:tensor<137x32x8x1xf32>>
        %3 = flow.dispatch.tensor.load %0, offsets = [0, 0, 0, 0], sizes = [137, 4, 8, 8], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<137x4x8x8xf32>> -> tensor<137x4x8x8xf32>
        %4 = flow.dispatch.tensor.load %1, offsets = [0, 0], sizes = [1089, 32], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<1089x32xf32>> -> tensor<1089x32xf32>
        %5 = tensor.empty() : tensor<137x32x8x1xf32>
        %6 = tensor.empty() : tensor<1089x32xf32>
        %unpack = tensor.unpack %3 outer_dims_perm = [0, 1] inner_dims_pos = [0, 1] inner_tiles = [8, 8] into %6 : tensor<137x4x8x8xf32> -> tensor<1089x32xf32>
        %7 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d1)>, affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%unpack, %cst, %4 : tensor<1089x32xf32>, tensor<32xf32>, tensor<1089x32xf32>) outs(%6 : tensor<1089x32xf32>) {
        ^bb0(%in: f32, %in_1: f32, %in_2: f32, %out: f32):
          %8 = arith.addf %in, %in_1 : f32
          %9 = arith.addf %8, %in_2 : f32
          linalg.yield %9 : f32
        } -> tensor<1089x32xf32>
        %pack = tensor.pack %7 padding_value(%cst_0 : f32) outer_dims_perm = [0, 1] inner_dims_pos = [0, 1] inner_tiles = [8, 1] into %5 : tensor<1089x32xf32> -> tensor<137x32x8x1xf32>
        flow.dispatch.tensor.store %pack, %2, offsets = [0, 0, 0, 0], sizes = [137, 32, 8, 1], strides = [1, 1, 1, 1] : tensor<137x32x8x1xf32> -> !flow.dispatch.tensor<writeonly:tensor<137x32x8x1xf32>>
        return
      }
    }
  }
}

Without the PR, we generate scalar loads (single-element "vector" loads) for the tensor.unpack because unit dims are not removed. With the PR, the vector loads have a proper size.

To my surprise, the generic op with dynamic shapes is not vectorized both before and after the PR because Pixel 6 doesn't support masking. I would expect peeling to be used for these cases but it looks like even though the peeling strategy is selected, that decision is overridden at some point in the pipeline.

Given that we don't vectorize the generic op, generating scalar loads for the tensor.unpack helps the forwarding of the loaded elements into the scalar operations generated for the generic op. When we generate proper vector loads for tensor.unpack with the PR, that forwarding doesn't happen and the end result is worse even though we are vectorizing the tensor.unpack op more efficiently.

Moving forward, I think we should:

- [x] Land the drop unit dimension PR and deal with the regression later given that it's only impacting Pixel 6.
- [ ] Remove the peeling strategy [override](https://github.com/dcaballe/iree/blob/cfe77c31cbb27c91ac0acb2422f5349e2c88d17f/compiler/src/iree/compiler/Codegen/LLVMCPU/KernelDispatch.cpp#L2131-L2135). That's really surprising for the user. If that strategy is not the appropriate one, we should make sure it's not selected at any point in the pipeline.
- [ ] Come up with a solution to vectorize the dispatch above for targets without masking support. This means, having an alternative lowering for pack/unpack ops that doesn't use masking.

cc: @hanhanW, @Max191, @MaheshRavishankar

iree-org / iree

[CPU][DT] Enable vectorization of pack/unpack dispatches for targets without masking support #16406