iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.56k stars 571 forks source link

[Multi-Use Fusion] Fail to compile ClipTextSeqLen64PT on x86_64 with multi-use fusion #12882

Open pzread opened 1 year ago

pzread commented 1 year ago

When compiling the model ClipTextSeqLen64PT on x86_64 with --iree-flow-fuse-multi-use, there are memref.alloca with dynamic size left in the middle of IR, causes LLVMCPUCheckIRBeforeLLVMConversion failed.

Reproduce

Input MLIR: https://storage.googleapis.com/iree-jerry-test/model_9a9515c7-cb68-4c34-b1d2-0e8c0a3620b8_ClipTextSeqLen64PT.mlir

iree-compile \
  --output-format=vm-bytecode \
  --iree-hal-target-backends=llvm-cpu \
  --iree-input-type=none \
  --iree-llvmcpu-target-triple=x86_64-unknown-linux-gnu \
  --iree-llvmcpu-target-cpu=cascadelake \
  --iree-flow-fuse-multi-use \
  ./model_9a9515c7-cb68-4c34-b1d2-0e8c0a3620b8_ClipTextSeqLen64PT.mlir \
  -o test.vmfb

Error:

<eval_with_key>.2:1108:18: error: 'memref.alloca' op all stack allocations need to be hoisted to the entry block of the function                                            
<eval_with_key>.2:1108:18: note: see current operation: %37 = "memref.alloca"(%33) {alignment = 64 : i64, operand_segment_sizes = array<i32: 1, 0>} : (index) -> memref<?xf3
2>                                                                                                                                                                          <eval_with_key>.2:1118:13: error: failed to run translation of source executable to target executable for backend #hal.executable.target<"llvm-cpu", "embedded-elf-x86_64", 
{cpu = "cascadelake", cpu_features = "+mmx,+popcnt,+sse,+sse2,+sse3,+ssse3,+sse4.1,+sse4.2,+avx,+avx2,+fma,+avx512f,+bmi,+bmi2,+aes,+pclmul,+avx512vl,+avx512bw,+avx512dq,+avx512cd,+avx512vnni,+adx,+clflushopt,+clwb,+cx16,+cx8,+crc32,+f16c,+fsgsbase,+fxsr,+invpcid,+lzcnt,+movbe,+pku,+prfchw,+rdrnd,+rdseed,+sahf,+x87,+xsave,+xsavec,+xsaveopt,+x
saves", data_layout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128", native_vector_size = 32 : index, target_triple = "x86_64-unknown-unknown-eab
i-elf"}>                                                                                                                                                                    
<eval_with_key>.2:1118:13: note: see current operation:                                                                                                                     
"hal.executable.variant"() ({                                                                                                                                               
  "hal.executable.export"() ({                                                                                                                                              
  ^bb0(%arg0: !hal.device, %arg1: index, %arg2: index):                                                                                                                     
    %0 = "arith.constant"() {value = 7 : index} : () -> index                                                                                                               
    %1 = "arith.constant"() {value = 1 : index} : () -> index                                                                                                               
    "hal.return"(%0, %1, %1) : (index, index, index) -> ()                                                                                                                  
  }) {layout = #hal.pipeline.layout<push_constants = 0, sets = [<0, bindings = [<0, storage_buffer, ReadOnly>, <1, storage_buffer, ReadOnly>, <2, storage_buffer>]>]>, ordin
al = 0 : index, sym_name = "forward_dispatch_170_generic_77x768", translation_info = #iree_codegen.translation_info<CPUDoubleTilingExpert>} : () -> ()                      
  "builtin.module"() ({                                                                                                                                                     
    "func.func"() ({                                                                                                                                                        
      %0 = "arith.constant"() {value = dense<0.000000e+00> : vector<8x8xf32>} : () -> vector<8x8xf32>                                                                       
      %1 = "arith.constant"() {value = 1 : index} : () -> index                                                                                                             
      %2 = "arith.constant"() {value = 2 : index} : () -> index                                                                                                             
      %3 = "arith.constant"() {value = 3 : index} : () -> index                                                                                                             
      %4 = "arith.constant"() {value = 4 : index} : () -> index                                                                                                             
      %5 = "arith.constant"() {value = 5 : index} : () -> index                                                                                                             
      %6 = "arith.constant"() {value = 6 : index} : () -> index                                                                                                             
      %7 = "arith.constant"() {value = 7 : index} : () -> index                                                                                                             
      %8 = "arith.constant"() {value = dense<0.000000e+00> : vector<8xf32>} : () -> vector<8xf32>                                                                           
      %9 = "arith.constant"() {value = 9.99999974E-6 : f32} : () -> f32                                                                                                     
      %10 = "arith.constant"() {value = 7.680000e+02 : f32} : () -> f32                                                                                                     
      %11 = "arith.constant"() {value = 0.000000e+00 : f32} : () -> f32                                                                                                     
      %12 = "arith.constant"() {value = 492239488 : index} : () -> index
pzread commented 1 year ago

It looks like we tile a generic op with the shape 11x768 with the tiling size (8, 8), which creates indivisible tiles with dynamic size. And because we have an intermediate alloca for that generic op, the alloca size is now also dynamic and can't be hoisted to the top.

----- CSE (After LLVMCPUTileAndFuse and LLVMCPUTile) 429811 -----
func.func @forward_dispatch_170_generic_77x768() {
  %c768 = arith.constant 768 : index
  %c11 = arith.constant 11 : index
  %c8 = arith.constant 8 : index
  %c0 = arith.constant 0 : index
  %c77 = arith.constant 77 : index
  %c709632 = arith.constant 709632 : index
  %c492233344 = arith.constant 492233344 : index
  %c492236416 = arith.constant 492236416 : index
  %c492239488 = arith.constant 492239488 : index
  %cst = arith.constant 0.000000e+00 : f32
  %cst_0 = arith.constant 7.680000e+02 : f32
  %cst_1 = arith.constant 9.99999974E-6 : f32
  %0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c709632) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<77x768xf32>>
  %1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c492233344) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<768xf32>>
  %2 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<77x768xf32>>
  %3 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c492236416) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<768xf32>>
  %4 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c492239488) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<768xf32>>
  %5 = hal.interface.binding.subspan set(0) binding(2) type(storage_buffer) alignment(64) offset(%c0) : !flow.dispatch.tensor<writeonly:tensor<77x768xf32>>
  %6 = flow.dispatch.tensor.load %1, offsets = [0], sizes = [768], strides = [1] : !flow.dispatch.tensor<readonly:tensor<768xf32>> -> tensor<768xf32>
  %7 = flow.dispatch.tensor.load %3, offsets = [0], sizes = [768], strides = [1] : !flow.dispatch.tensor<readonly:tensor<768xf32>> -> tensor<768xf32>
  %8 = flow.dispatch.tensor.load %4, offsets = [0], sizes = [768], strides = [1] : !flow.dispatch.tensor<readonly:tensor<768xf32>> -> tensor<768xf32>
  %workgroup_id_x = hal.interface.workgroup.id[0] : index
  %workgroup_count_x = hal.interface.workgroup.count[0] : index
  %9 = affine.apply affine_map<()[s0] -> (s0 * 11)>()[%workgroup_id_x]
  %10 = affine.apply affine_map<()[s0] -> (s0 * 11)>()[%workgroup_count_x]
  scf.for %arg0 = %9 to %c77 step %10 {
    %11 = flow.dispatch.tensor.load %5, offsets = [%arg0, 0], sizes = [11, 768], strides = [1, 1] : !flow.dispatch.tensor<writeonly:tensor<77x768xf32>> -> tensor<11x768xf32>
    %12 = flow.dispatch.tensor.load %0, offsets = [%arg0, 0], sizes = [11, 768], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<77x768xf32>> -> tensor<11x768xf32>
    %13 = flow.dispatch.tensor.load %2, offsets = [%arg0, 0], sizes = [11, 768], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<77x768xf32>> -> tensor<11x768xf32>
    %14 = scf.for %arg1 = %c0 to %c11 step %c8 iter_args(%arg2 = %11) -> (tensor<11x768xf32>) {
      %15 = affine.min affine_map<(d0) -> (-d0 + 11, 8)>(%arg1) // Indivisible tiling size
      %extracted_slice = tensor.extract_slice %12[%arg1, 0] [%15, 768] [1, 1] : tensor<11x768xf32> to tensor<?x768xf32>
      %extracted_slice_2 = tensor.extract_slice %13[%arg1, 0] [%15, 768] [1, 1] : tensor<11x768xf32> to tensor<?x768xf32>
      %16 = tensor.empty(%15) : tensor<?xf32> // Dynamic allocation
      %17 = linalg.fill ins(%cst : f32) outs(%16 : tensor<?xf32>) -> tensor<?xf32>
      %18 = scf.for %arg3 = %c0 to %c768 step %c8 iter_args(%arg4 = %17) -> (tensor<?xf32>) {
        %extracted_slice_4 = tensor.extract_slice %extracted_slice[0, %arg3] [%15, 8] [1, 1] : tensor<?x768xf32> to tensor<?x8xf32>
        %extracted_slice_5 = tensor.extract_slice %6[%arg3] [8] [1] : tensor<768xf32> to tensor<8xf32>
        %extracted_slice_6 = tensor.extract_slice %extracted_slice_2[0, %arg3] [%15, 8] [1, 1] : tensor<?x768xf32> to tensor<?x8xf32>
        %extracted_slice_7 = tensor.extract_slice %arg4[0] [%15] [1] : tensor<?xf32> to tensor<?xf32>
        %21 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d1)>, affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0)>], iterator_types = ["parallel", "reduction"]} ins(%extracted_slice_4, %extracted_slice_5, %extracted_slice_6 : tensor<?x8xf32>, tensor<8xf32>, tensor<?x8xf32>) outs(%extracted_slice_7 : tensor<?xf32>) {
        ^bb0(%in: f32, %in_9: f32, %in_10: f32, %out: f32):
          %22 = arith.addf %in_9, %in_10 : f32
          %23 = arith.addf %in, %22 : f32
          %24 = arith.addf %23, %out : f32
          linalg.yield %24 : f32
        } -> tensor<?xf32>
        %inserted_slice_8 = tensor.insert_slice %21 into %arg4[0] [%15] [1] : tensor<?xf32> into tensor<?xf32>
        scf.yield %inserted_slice_8 : tensor<?xf32>
      }
      %19 = scf.for %arg3 = %c0 to %c768 step %c8 iter_args(%arg4 = %17) -> (tensor<?xf32>) {
        %extracted_slice_4 = tensor.extract_slice %extracted_slice[0, %arg3] [%15, 8] [1, 1] : tensor<?x768xf32> to tensor<?x8xf32>
        %extracted_slice_5 = tensor.extract_slice %6[%arg3] [8] [1] : tensor<768xf32> to tensor<8xf32>
        %extracted_slice_6 = tensor.extract_slice %extracted_slice_2[0, %arg3] [%15, 8] [1, 1] : tensor<?x768xf32> to tensor<?x8xf32>
        %extracted_slice_7 = tensor.extract_slice %18[0] [%15] [1] : tensor<?xf32> to tensor<?xf32>
        %extracted_slice_8 = tensor.extract_slice %arg4[0] [%15] [1] : tensor<?xf32> to tensor<?xf32>
        %21 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d1)>, affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0)>, affine_map<(d0, d1) -> (d0)>], iterator_types = ["parallel", "reduction"]} ins(%extracted_slice_4, %extracted_slice_5, %extracted_slice_6, %extracted_slice_7 : tensor<?x8xf32>, tensor<8xf32>, tensor<?x8xf32>, tensor<?xf32>) outs(%extracted_slice_8 : tensor<?xf32>) attrs =  {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[11, 0], [8, 0], [0, 8]]>} {
        ^bb0(%in: f32, %in_10: f32, %in_11: f32, %in_12: f32, %out: f32):
          %22 = arith.addf %in_10, %in_11 : f32
          %23 = arith.addf %in, %22 : f32
          %24 = arith.divf %in_12, %cst_0 : f32
          %25 = arith.subf %23, %24 : f32
          %26 = arith.mulf %25, %25 : f32
          %27 = arith.addf %26, %out : f32
          linalg.yield %27 : f32
        } -> tensor<?xf32>
        %inserted_slice_9 = tensor.insert_slice %21 into %arg4[0] [%15] [1] : tensor<?xf32> into tensor<?xf32>
        scf.yield %inserted_slice_9 : tensor<?xf32>
      }
      %extracted_slice_3 = tensor.extract_slice %arg2[%arg1, 0] [%15, 768] [1, 1] : tensor<11x768xf32> to tensor<?x768xf32>
      %20 = scf.for %arg3 = %c0 to %c768 step %c8 iter_args(%arg4 = %extracted_slice_3) -> (tensor<?x768xf32>) {
        %extracted_slice_4 = tensor.extract_slice %extracted_slice[0, %arg3] [%15, 8] [1, 1] : tensor<?x768xf32> to tensor<?x8xf32>
        %extracted_slice_5 = tensor.extract_slice %6[%arg3] [8] [1] : tensor<768xf32> to tensor<8xf32>
        %extracted_slice_6 = tensor.extract_slice %extracted_slice_2[0, %arg3] [%15, 8] [1, 1] : tensor<?x768xf32> to tensor<?x8xf32>
        %extracted_slice_7 = tensor.extract_slice %18[0] [%15] [1] : tensor<?xf32> to tensor<?xf32>
        %extracted_slice_8 = tensor.extract_slice %19[0] [%15] [1] : tensor<?xf32> to tensor<?xf32>
        %extracted_slice_9 = tensor.extract_slice %7[%arg3] [8] [1] : tensor<768xf32> to tensor<8xf32>
        %extracted_slice_10 = tensor.extract_slice %8[%arg3] [8] [1] : tensor<768xf32> to tensor<8xf32>
        %extracted_slice_11 = tensor.extract_slice %arg4[0, %arg3] [%15, 8] [1, 1] : tensor<?x768xf32> to tensor<?x8xf32>
        %21 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d1)>, affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0)>, affine_map<(d0, d1) -> (d0)>, affine_map<(d0, d1) -> (d1)>, affine_map<(d0, d1) -> (d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%extracted_slice_4, %extracted_slice_5, %extracted_slice_6, %extracted_slice_7, %extracted_slice_8, %extracted_slice_9, %extracted_slice_10 : tensor<?x8xf32>, tensor<8xf32>, tensor<?x8xf32>, tensor<?xf32>, tensor<?xf32>, tensor<8xf32>, tensor<8xf32>) outs(%extracted_slice_11 : tensor<?x8xf32>) {
        ^bb0(%in: f32, %in_13: f32, %in_14: f32, %in_15: f32, %in_16: f32, %in_17: f32, %in_18: f32, %out: f32):
          %22 = arith.addf %in_13, %in_14 : f32
          %23 = arith.addf %in, %22 : f32
          %24 = arith.divf %in_15, %cst_0 : f32
          %25 = arith.subf %23, %24 : f32
          %26 = arith.divf %in_16, %cst_0 : f32
          %27 = arith.addf %26, %cst_1 : f32
          %28 = math.rsqrt %27 : f32
          %29 = arith.mulf %25, %28 : f32
          %30 = arith.mulf %29, %in_17 : f32
          %31 = arith.addf %30, %in_18 : f32
          linalg.yield %31 : f32
        } -> tensor<?x8xf32>
        %inserted_slice_12 = tensor.insert_slice %21 into %arg4[0, %arg3] [%15, 8] [1, 1] : tensor<?x8xf32> into tensor<?x768xf32>
        scf.yield %inserted_slice_12 : tensor<?x768xf32>
      }
      %inserted_slice = tensor.insert_slice %20 into %arg2[%arg1, 0] [%15, 768] [1, 1] : tensor<?x768xf32> into tensor<11x768xf32>
      scf.yield %inserted_slice : tensor<11x768xf32>
    }
    flow.dispatch.tensor.store %14, %5, offsets = [%arg0, 0], sizes = [11, 768], strides = [1, 1] : tensor<11x768xf32> -> !flow.dispatch.tensor<writeonly:tensor<77x768xf32>>
  }
  return
}
dcaballe commented 1 year ago

I think this is limitation of the logic that computes the UB for the memref. It should be fixed once all the ValueBound patches land?

(Shouldn't we be fusing all the scf.for %arg3 = %c0 to %c768 step %c8 iter_args(%arg4 = %17) -> (tensor<?xf32>)?)