iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.57k stars 574 forks source link

[CPU][DT] Enable vectorization of pack/unpack dispatches for targets without masking support #16406

Open dcaballe opened 7 months ago

dcaballe commented 7 months ago

We ran into a regression in https://github.com/openxla/iree/pull/16286#issuecomment-1925567807 that impacted DeepLabV3 on Pixel 6. The regression was gone when reverting a PR that dropped some unit dims from vector operations: https://github.com/llvm/llvm-project/pull/79752

Investigating this further, I only found IR differences in the following dispatch (and similar ones) with a tensor.unpack + generic op with dynamic shapes + tensor.pack:

hal.executable public @main_dispatch_84 {
  hal.executable.variant public @system_elf_arm_64 target(<"llvm-cpu", "system-elf-arm_64", {cpu = "generic", cpu_features = "+neon,+dotprod,+reserve-x18", data_layout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128", link_embedded = false, native_vector_size = 16 : index, target_triple = "aarch64-none-linux-android34", ukernels = "all"}>) {
    hal.executable.export public @main_dispatch_84_unpack_generic_1089x32_f32_pack ordinal(0) layout(#hal.pipeline.layout<push_constants = 0, sets = [<0, bindings = [<0, storage_buffer, ReadOnly>, <1, storage_buffer>]>]>) {
    ^bb0(%arg0: !hal.device):
      %x, %y, %z = flow.dispatch.workgroup_count_from_slice
      hal.return %x, %y, %z : index, index, index
    }
    builtin.module {
      func.func @main_dispatch_84_unpack_generic_1089x32_f32_pack() {
        %cst = arith.constant dense_resource<__elided__> : tensor<32xf32>
        %cst_0 = arith.constant 0.000000e+00 : f32
        %c0 = arith.constant 0 : index
        %c140288 = arith.constant 140288 : index
        %c279680 = arith.constant 279680 : index
        %0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<137x4x8x8xf32>>
        %1 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c140288) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<1089x32xf32>>
        %2 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c279680) : !flow.dispatch.tensor<writeonly:tensor<137x32x8x1xf32>>
        %3 = flow.dispatch.tensor.load %0, offsets = [0, 0, 0, 0], sizes = [137, 4, 8, 8], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<137x4x8x8xf32>> -> tensor<137x4x8x8xf32>
        %4 = flow.dispatch.tensor.load %1, offsets = [0, 0], sizes = [1089, 32], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<1089x32xf32>> -> tensor<1089x32xf32>
        %5 = tensor.empty() : tensor<137x32x8x1xf32>
        %6 = tensor.empty() : tensor<1089x32xf32>
        %unpack = tensor.unpack %3 outer_dims_perm = [0, 1] inner_dims_pos = [0, 1] inner_tiles = [8, 8] into %6 : tensor<137x4x8x8xf32> -> tensor<1089x32xf32>
        %7 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d1)>, affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%unpack, %cst, %4 : tensor<1089x32xf32>, tensor<32xf32>, tensor<1089x32xf32>) outs(%6 : tensor<1089x32xf32>) {
        ^bb0(%in: f32, %in_1: f32, %in_2: f32, %out: f32):
          %8 = arith.addf %in, %in_1 : f32
          %9 = arith.addf %8, %in_2 : f32
          linalg.yield %9 : f32
        } -> tensor<1089x32xf32>
        %pack = tensor.pack %7 padding_value(%cst_0 : f32) outer_dims_perm = [0, 1] inner_dims_pos = [0, 1] inner_tiles = [8, 1] into %5 : tensor<1089x32xf32> -> tensor<137x32x8x1xf32>
        flow.dispatch.tensor.store %pack, %2, offsets = [0, 0, 0, 0], sizes = [137, 32, 8, 1], strides = [1, 1, 1, 1] : tensor<137x32x8x1xf32> -> !flow.dispatch.tensor<writeonly:tensor<137x32x8x1xf32>>
        return
      }
    }
  }
}

Without the PR, we generate scalar loads (single-element "vector" loads) for the tensor.unpack because unit dims are not removed. With the PR, the vector loads have a proper size.

To my surprise, the generic op with dynamic shapes is not vectorized both before and after the PR because Pixel 6 doesn't support masking. I would expect peeling to be used for these cases but it looks like even though the peeling strategy is selected, that decision is overridden at some point in the pipeline.

Given that we don't vectorize the generic op, generating scalar loads for the tensor.unpack helps the forwarding of the loaded elements into the scalar operations generated for the generic op. When we generate proper vector loads for tensor.unpack with the PR, that forwarding doesn't happen and the end result is worse even though we are vectorizing the tensor.unpack op more efficiently.

Moving forward, I think we should:

- [x] Land the drop unit dimension PR and deal with the regression later given that it's only impacting Pixel 6.
- [ ] Remove the peeling strategy [override](https://github.com/dcaballe/iree/blob/cfe77c31cbb27c91ac0acb2422f5349e2c88d17f/compiler/src/iree/compiler/Codegen/LLVMCPU/KernelDispatch.cpp#L2131-L2135). That's really surprising for the user. If that strategy is not the appropriate one, we should make sure it's not selected at any point in the pipeline.
- [ ] Come up with a solution to vectorize the dispatch above for targets without masking support. This means, having an alternative lowering for pack/unpack ops that doesn't use masking.

cc: @hanhanW, @Max191, @MaheshRavishankar

dcaballe commented 7 months ago

I caught up with Hanhan about this and we agreed on moving forward with the re-land and deal with the Pixel 6 regression later, using this issue for tracking. Working on the reland.