[CPU] Lots of ops generated in conv + generic + pack dispatch

pzread commented 6 months ago

When compiling the example below with:

iree-compile conv_generic_pack.mlir -o /dev/null --iree-hal-target-backends=llvm-cpu --iree-input-type=none --iree-llvmcpu-target-triple=x86_64-unknown-linux-gnu --iree-llvmcpu-target-cpu=cascadelake --iree-opt-data-tiling=true --iree-llvmcpu-enable-ukernels=all --mlir-print-ir-after-all 2> conv_generic_pack.dump.mlir

#map = affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>
#map1 = affine_map<(d0, d1, d2, d3) -> (d3)>
func.func @forward_dispatch_12_conv_2d_nhwc_hwcf_1x96x96x192x3x3x48_f32_pack(%arg0: tensor<1x98x98x48xf32>, %arg1: tensor<3x3x48x192xf32>, %arg2: tensor<192xf32>) -> tensor<1x96x6x192x16x1xf32> {
  %cst = arith.constant 1.000000e+00 : f32
  %cst_0 = arith.constant 0.000000e+00 : f32
  %0 = tensor.empty() : tensor<1x96x6x192x16x1xf32>
  %1 = tensor.empty() : tensor<1x96x96x192xf32>
  %2 = linalg.fill ins(%cst_0 : f32) outs(%1 : tensor<1x96x96x192xf32>) -> tensor<1x96x96x192xf32>
  %3 = linalg.conv_2d_nhwc_hwcf {dilations = dense<1> : tensor<2xi64>, strides = dense<1> : tensor<2xi64>} ins(%arg0, %arg1 : tensor<1x98x98x48xf32>, tensor<3x3x48x192xf32>) outs(%2 : tensor<1x96x96x192xf32>) -> tensor<1x96x96x192xf32>
  %4 = linalg.generic {indexing_maps = [#map, #map1, #map], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%3, %arg2 : tensor<1x96x96x192xf32>, tensor<192xf32>) outs(%1 : tensor<1x96x96x192xf32>) {
  ^bb0(%in: f32, %in_1: f32, %out: f32):
    %5 = arith.addf %in, %in_1 : f32
    linalg.yield %5 : f32
  } -> tensor<1x96x96x192xf32>
  %pack = tensor.pack %4 outer_dims_perm = [0, 1, 2, 3] inner_dims_pos = [2, 3] inner_tiles = [16, 1] into %0 : tensor<1x96x96x192xf32> -> tensor<1x96x6x192x16x1xf32>
  return %pack : tensor<1x96x6x192x16x1xf32>
}

It will generate a lot of ops during LLVMCPUVirtualVectorLowering, which results in bad performance.

I suspect it is due the the wrong tile sizes used on the convolution op. In the dump below we can see that [1, 1, 8, 16, 0, 0, 0] is set for the parallel dim tile sizes for the convolution op, but after tile-and-fuse, the final tile size of the convolution op is tensor<1x1x96x16xf32>.

// -----// IR Dump After LLVMCPUSelectLoweringStrategy (iree-llvmcpu-select-lowering-strategy) //----- //
hal.executable.variant public @embedded_elf_x86_64 target(<"llvm-cpu", "embedded-elf-x86_64", {cpu = "cascadelake", cpu_features = "+cmov,+mmx,+popcnt,+sse,+sse2,+sse3,+ssse3,+sse4.1,+sse4.2,+avx,+avx2,+fma,+avx512f,+bmi,+bmi2,+aes,+pclmul,+avx512vl,+avx512bw,+avx512dq,+avx512cd,+avx512vnni,+adx,+clflushopt,+clwb,+cx16,+cx8,+f16c,+fsgsbase,+crc32,+invpcid,+sahf,+lzcnt,+movbe,+x87,+pku,+prfchw,+rdrnd,+rdseed,+xsave,+xsavec,+xsaveopt,+xsaves,+fxsr,+evex512", data_layout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128", native_vector_size = 64 : i64, target_triple = "x86_64-unknown-unknown-eabi-elf", ukernels = "all"}>) {
  hal.executable.export public @forward_dispatch_12_conv_2d_nhwc_hwcf_1x96x96x192x3x3x48_f32_pack_dispatch_0_conv_2d_nhwc_hwcf_1x96x96x192x3x3x48_f32_pack ordinal(0) layout(#hal.pipeline.layout<push_constants = 0, sets = [<0, bindings = [<0, storage_buffer, ReadOnly>, <1, storage_buffer, ReadOnly>, <2, storage_buffer, ReadOnly>, <3, storage_buffer>]>]>) attributes {hal.interface.bindings = [#hal.interface.binding<0, 0>, #hal.interface.binding<0, 1>, #hal.interface.binding<0, 2>, #hal.interface.binding<0, 3>], translation_info = #iree_codegen.translation_info<CPUConvTileAndDecomposeExpert>} {
  ^bb0(%arg0: !hal.device):
    %x, %y, %z = flow.dispatch.workgroup_count_from_slice 
    hal.return %x, %y, %z : index, index, index
  }
  builtin.module {
    func.func @forward_dispatch_12_conv_2d_nhwc_hwcf_1x96x96x192x3x3x48_f32_pack_dispatch_0_conv_2d_nhwc_hwcf_1x96x96x192x3x3x48_f32_pack() {
      %cst = arith.constant 0.000000e+00 : f32
      %c0 = arith.constant 0 : index
      %0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<1x98x98x48xf32>>
      %1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<3x3x48x192xf32>>
      %2 = hal.interface.binding.subspan set(0) binding(2) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<192xf32>>
      %3 = hal.interface.binding.subspan set(0) binding(3) type(storage_buffer) alignment(64) offset(%c0) : !flow.dispatch.tensor<writeonly:tensor<1x96x6x192x16x1xf32>>
      %4 = flow.dispatch.tensor.load %0, offsets = [0, 0, 0, 0], sizes = [1, 98, 98, 48], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<1x98x98x48xf32>> -> tensor<1x98x98x48xf32>
      %5 = flow.dispatch.tensor.load %1, offsets = [0, 0, 0, 0], sizes = [3, 3, 48, 192], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<3x3x48x192xf32>> -> tensor<3x3x48x192xf32>
      %6 = flow.dispatch.tensor.load %2, offsets = [0], sizes = [192], strides = [1] : !flow.dispatch.tensor<readonly:tensor<192xf32>> -> tensor<192xf32>
      %7 = tensor.empty() : tensor<1x96x6x192x16x1xf32>
      %8 = tensor.empty() : tensor<1x96x96x192xf32>
      %9 = linalg.fill ins(%cst : f32) outs(%8 : tensor<1x96x96x192xf32>) -> tensor<1x96x96x192xf32>
      %10 = linalg.conv_2d_nhwc_hwcf {dilations = dense<1> : tensor<2xi64>, lowering_config = #iree_codegen.lowering_config<tile_sizes = [[0, 48, 48, 64, 0, 0, 0], [1, 1, 8, 16, 0, 0, 0], [0, 0, 0, 0, 1, 1, 8], [0, 0, 0, 0, 0, 0, 0]]>, strides = dense<1> : tensor<2xi64>} ins(%4, %5 : tensor<1x98x98x48xf32>, tensor<3x3x48x192xf32>) outs(%9 : tensor<1x96x96x192xf32>) -> tensor<1x96x96x192xf32>
      %11 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%10, %6 : tensor<1x96x96x192xf32>, tensor<192xf32>) outs(%8 : tensor<1x96x96x192xf32>) {
      ^bb0(%in: f32, %in_0: f32, %out: f32):
        %12 = arith.addf %in, %in_0 : f32
        linalg.yield %12 : f32
      } -> tensor<1x96x96x192xf32>
      %pack = tensor.pack %11 outer_dims_perm = [0, 1, 2, 3] inner_dims_pos = [2, 3] inner_tiles = [16, 1] into %7 : tensor<1x96x96x192xf32> -> tensor<1x96x6x192x16x1xf32>
      flow.dispatch.tensor.store %pack, %3, offsets = [0, 0, 0, 0, 0, 0], sizes = [1, 96, 6, 192, 16, 1], strides = [1, 1, 1, 1, 1, 1] : tensor<1x96x6x192x16x1xf32> -> !flow.dispatch.tensor<writeonly:tensor<1x96x6x192x16x1xf32>>
      return
    }
  }
}

...

// -----// IR Dump After LLVMCPUTileAndFuse (iree-llvmcpu-tile-and-fuse) //----- //
func.func @forward_dispatch_12_conv_2d_nhwc_hwcf_1x96x96x192x3x3x48_f32_pack_dispatch_0_conv_2d_nhwc_hwcf_1x96x96x192x3x3x48_f32_pack() {
  %c8 = arith.constant 8 : index
  %c3 = arith.constant 3 : index
  %c16 = arith.constant 16 : index
  %c64 = arith.constant 64 : index
  %c48 = arith.constant 48 : index
  %c1 = arith.constant 1 : index
  %c192 = arith.constant 192 : index
  %c6 = arith.constant 6 : index
  %c96 = arith.constant 96 : index
  %cst = arith.constant 0.000000e+00 : f32
  %c0 = arith.constant 0 : index
  %0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<1x98x98x48xf32>>
  %1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<3x3x48x192xf32>>
  %2 = hal.interface.binding.subspan set(0) binding(2) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<192xf32>>
  %3 = hal.interface.binding.subspan set(0) binding(3) type(storage_buffer) alignment(64) offset(%c0) : !flow.dispatch.tensor<writeonly:tensor<1x96x6x192x16x1xf32>>
  %workgroup_id_x = hal.interface.workgroup.id[0] : index
  %workgroup_count_x = hal.interface.workgroup.count[0] : index
  %workgroup_id_y = hal.interface.workgroup.id[1] : index
  %workgroup_count_y = hal.interface.workgroup.count[1] : index
  %workgroup_id_z = hal.interface.workgroup.id[2] : index
  %workgroup_count_z = hal.interface.workgroup.count[2] : index
  %4 = affine.apply affine_map<()[s0] -> (s0 * 48)>()[%workgroup_id_z]
  %5 = affine.apply affine_map<()[s0] -> (s0 * 48)>()[%workgroup_count_z]
  scf.for %arg0 = %4 to %c96 step %5 {
    %6 = affine.apply affine_map<()[s0] -> (s0 * 48)>()[%workgroup_id_y]
    %7 = affine.apply affine_map<()[s0] -> (s0 * 48)>()[%workgroup_count_y]
    scf.for %arg1 = %6 to %c6 step %7 {
      %8 = affine.apply affine_map<()[s0] -> (s0 * 64)>()[%workgroup_id_x]
      %9 = affine.apply affine_map<()[s0] -> (s0 * 64)>()[%workgroup_count_x]
      scf.for %arg2 = %8 to %c192 step %9 {
        %10 = flow.dispatch.tensor.load %3, offsets = [0, %arg0, %arg1, %arg2, 0, 0], sizes = [1, 48, 6, 64, 16, 1], strides = [1, 1, 1, 1, 1, 1] : !flow.dispatch.tensor<writeonly:tensor<1x96x6x192x16x1xf32>> -> tensor<1x48x6x64x16x1xf32>
        %11 = affine.apply affine_map<(d0) -> (d0 * 16)>(%arg1)
        %12 = flow.dispatch.tensor.load %0, offsets = [0, %arg0, %11, 0], sizes = [1, 50, 98, 48], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<1x98x98x48xf32>> -> tensor<1x50x98x48xf32>
        %13 = flow.dispatch.tensor.load %1, offsets = [0, 0, 0, %arg2], sizes = [3, 3, 48, 64], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<3x3x48x192xf32>> -> tensor<3x3x48x64xf32>
        %14 = flow.dispatch.tensor.load %2, offsets = [%arg2], sizes = [64], strides = [1] : !flow.dispatch.tensor<readonly:tensor<192xf32>> -> tensor<64xf32>
        %15 = scf.for %arg3 = %c0 to %c48 step %c1 iter_args(%arg4 = %10) -> (tensor<1x48x6x64x16x1xf32>) {
          %16 = scf.for %arg5 = %c0 to %c64 step %c16 iter_args(%arg6 = %arg4) -> (tensor<1x48x6x64x16x1xf32>) {
            %extracted_slice = tensor.extract_slice %12[0, %arg3, 0, 0] [1, 3, 98, 48] [1, 1, 1, 1] : tensor<1x50x98x48xf32> to tensor<1x3x98x48xf32>
            %extracted_slice_0 = tensor.extract_slice %13[0, 0, 0, %arg5] [3, 3, 48, 16] [1, 1, 1, 1] : tensor<3x3x48x64xf32> to tensor<3x3x48x16xf32>
            %17 = tensor.empty() : tensor<1x1x96x16xf32>
            %18 = linalg.fill ins(%cst : f32) outs(%17 : tensor<1x1x96x16xf32>) -> tensor<1x1x96x16xf32>
            %19 = scf.for %arg7 = %c0 to %c3 step %c1 iter_args(%arg8 = %18) -> (tensor<1x1x96x16xf32>) {
              %22 = scf.for %arg9 = %c0 to %c3 step %c1 iter_args(%arg10 = %arg8) -> (tensor<1x1x96x16xf32>) {
                %23 = scf.for %arg11 = %c0 to %c48 step %c8 iter_args(%arg12 = %arg10) -> (tensor<1x1x96x16xf32>) {
                  %extracted_slice_3 = tensor.extract_slice %extracted_slice[0, %arg7, %arg9, %arg11] [1, 1, 96, 8] [1, 1, 1, 1] : tensor<1x3x98x48xf32> to tensor<1x1x96x8xf32>
                  %extracted_slice_4 = tensor.extract_slice %extracted_slice_0[%arg7, %arg9, %arg11, 0] [1, 1, 8, 16] [1, 1, 1, 1] : tensor<3x3x48x16xf32> to tensor<1x1x8x16xf32>
                  %24 = linalg.conv_2d_nhwc_hwcf {dilations = dense<1> : tensor<2xi64>, lowering_config = #iree_codegen.lowering_config<tile_sizes = [[0, 48, 48, 64, 0, 0, 0], [1, 1, 8, 16, 0, 0, 0], [0, 0, 0, 0, 1, 1, 8], [0, 0, 0, 0, 0, 0, 0]]>, strides = dense<1> : tensor<2xi64>} ins(%extracted_slice_3, %extracted_slice_4 : tensor<1x1x96x8xf32>, tensor<1x1x8x16xf32>) outs(%arg12 : tensor<1x1x96x16xf32>) -> tensor<1x1x96x16xf32>
                  scf.yield %24 : tensor<1x1x96x16xf32>
                }
                scf.yield %23 : tensor<1x1x96x16xf32>
              }
              scf.yield %22 : tensor<1x1x96x16xf32>
            }
            %extracted_slice_1 = tensor.extract_slice %14[%arg5] [16] [1] : tensor<64xf32> to tensor<16xf32>
            %20 = tensor.empty() : tensor<1x1x96x16xf32>
            %21 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%19, %extracted_slice_1 : tensor<1x1x96x16xf32>, tensor<16xf32>) outs(%20 : tensor<1x1x96x16xf32>) {
            ^bb0(%in: f32, %in_3: f32, %out: f32):
              %22 = arith.addf %in, %in_3 : f32
              linalg.yield %22 : f32
            } -> tensor<1x1x96x16xf32>
            %extracted_slice_2 = tensor.extract_slice %arg6[0, %arg3, 0, %arg5, 0, 0] [1, 1, 6, 16, 16, 1] [1, 1, 1, 1, 1, 1] : tensor<1x48x6x64x16x1xf32> to tensor<1x1x6x16x16x1xf32>
            %pack = tensor.pack %21 outer_dims_perm = [0, 1, 2, 3] inner_dims_pos = [2, 3] inner_tiles = [16, 1] into %extracted_slice_2 : tensor<1x1x96x16xf32> -> tensor<1x1x6x16x16x1xf32>
            %inserted_slice = tensor.insert_slice %pack into %arg6[0, %arg3, 0, %arg5, 0, 0] [1, 1, 6, 16, 16, 1] [1, 1, 1, 1, 1, 1] : tensor<1x1x6x16x16x1xf32> into tensor<1x48x6x64x16x1xf32>
            scf.yield %inserted_slice : tensor<1x48x6x64x16x1xf32>
          }
          scf.yield %16 : tensor<1x48x6x64x16x1xf32>
        }
        flow.dispatch.tensor.store %15, %3, offsets = [0, %arg0, %arg1, %arg2, 0, 0], sizes = [1, 48, 6, 64, 16, 1], strides = [1, 1, 1, 1, 1, 1] : tensor<1x48x6x64x16x1xf32> -> !flow.dispatch.tensor<writeonly:tensor<1x96x6x192x16x1xf32>>
      }
    }
  }
  return
}

I think it is due to the TODO below, which doesn't propagate and scale the tile sizes for the convolution op to the pack op. When we do the tile-and-fuse, the last compute op: tensor.pack in the dispatch is used as the start point and if it doesn't have lowering config, the lowering config from the root op: convolution is directly borrowed. However the tile sizes of pack op on outer dims need to be scaled with its inner tile sizes. Directly borrowing from convolution op will result in too large tile sizes.

https://github.com/openxla/iree/blob/cdff01fcf74f8799a10dddcd5d279f6bbba9ebcc/compiler/src/iree/compiler/Codegen/LLVMCPU/KernelDispatch.cpp#L2302-L2309

This regresses the EfficientNetV2 latency when working on https://github.com/openxla/iree/issues/16682 to create more pack/unpack fusions

hanhanW commented 5 months ago

I thought multi lowering_config should work for convolution.. The issue is probably from scalable vectorization... Can you try disabling it only when scalable vector is involved?

pzread commented 5 months ago

I thought multi lowering_config should work for convolution.. The issue is probably from scalable vectorization... Can you try disabling it only when scalable vector is involved?

Not sure if I follow this. My understanding is that due to the TODO I mentioned in the issue, we don't set multi lower_config when the root op is a convolution op. In the case where convolution is followed by a pack op, the tile-and-fuse will start from the pack op but directly use the tilling config from the convolution op, which results in incorrect tile sizes (the tile sizes should be scaled with inner tile size)

https://github.com/openxla/iree/blob/529826f1ab9cbe9f47f244557b0cc52cc5a83f07/compiler/src/iree/compiler/Codegen/LLVMCPU/LLVMCPUTileAndFuse.cpp#L243-L261

hanhanW commented 5 months ago

I meant, does it work on x86 CPUs if we comment out below lines?

https://github.com/openxla/iree/blob/cdff01fcf74f8799a10dddcd5d279f6bbba9ebcc/compiler/src/iree/compiler/Codegen/LLVMCPU/KernelDispatch.cpp#L2305-L2309

iree-org / iree

[CPU] Lots of ops generated in conv + generic + pack dispatch #16775