iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.84k stars 611 forks source link

compilation fails because of vector size verification error #19005

Open ziereis opened 1 week ago

ziereis commented 1 week ago

What happened?

Compilation to llvm-cpu fails with error: One or more operations with large vector sizes (8192 bytes) were found

Input IR:

#map = affine_map<(d0, d1, d2, d3) -> (d0, d2, d3)>
#map1 = affine_map<(d0, d1, d2, d3) -> (d1, d2, d3)>
#map2 = affine_map<(d0, d1, d2, d3) -> (d0, d1, d2)>
#map3 = affine_map<(d0, d1, d2) -> (d0, d1, d2)>
#map4 = affine_map<(d0, d1, d2) -> (d0, d2)>
#map5 = affine_map<(d0, d1, d2) -> (d1, d2)>
#map6 = affine_map<(d0, d1, d2) -> (d0, d1)>
module {
  util.func public @main(%arg0: tensor<128x256x32xi8>, %arg1: tensor<10x256x32xi8>, %arg2: tensor<128x256xf32>, %arg3: tensor<10x256xf32>) -> tensor<10x128xf32> {
    %0 = tensor.empty() : tensor<10x128xf32>
    %2 = tensor.empty() : tensor<10x128x256xi32>
    %4 = linalg.generic {indexing_maps = [#map, #map1, #map2], iterator_types = ["parallel", "parallel", "parallel", "reduction"]} ins(%arg1, %arg0 : tensor<10x256x32xi8>, tensor<128x256x32xi8>) outs(%2 : tensor<10x128x256xi32>) {
    ^bb0(%in: i8, %in_1: i8, %out: i32):
      %6 = arith.extsi %in : i8 to i32
      %7 = arith.extsi %in_1 : i8 to i32
      %8 = arith.muli %6, %7 : i32
      %9 = arith.addi %8, %out : i32
      linalg.yield %9 : i32
    } -> tensor<10x128x256xi32>
    %5 = linalg.generic {indexing_maps = [#map3, #map4, #map5, #map6], iterator_types = ["parallel", "parallel", "reduction"]} ins(%4, %arg3, %arg2 : tensor<10x128x256xi32>, tensor<10x256xf32>, tensor<128x256xf32>) outs(%0 : tensor<10x128xf32>) {
    ^bb0(%in: i32, %in_1: f32, %in_2: f32, %out: f32):
      %6 = arith.sitofp %in : i32 to f32
      %7 = arith.mulf %in_2, %in_1 : f32
      %8 = arith.mulf %7, %6 : f32
      %9 = arith.addf %8, %out : f32
      linalg.yield %9 : f32
    } -> tensor<10x128xf32>
    util.return %5 : tensor<10x128xf32>
  }
}

This fails to compile, by changing the second dimensions of the tensors i.e. 256 in this case you can get it to compile. For example 32 works.

example error:

error: One or more operations with large vector sizes (8192 bytes) were found:

%24 = vector.transfer_read %13[%c0, %17, %23, %c0, %c0], %c0_i32 {in_bounds = [true, true, true, true, true]} : tensor<256x1x8x8x4xi32>, vector<256x1x1x8x4xi32>

Steps to reproduce your issue

iree-compile --iree-hal-target-device=llvm-cpu input.mlir

What component(s) does this issue relate to?

Compiler

Version information

9c85e30

Additional context

No response

kuhar commented 5 days ago

@hanhanW seems like something with the vectorizer potentially?

MaheshRavishankar commented 5 days ago

Cc @pashu123 as well. Seems like a tile size issue

pashu123 commented 5 days ago

@ziereis Could you change https://github.com/iree-org/iree/blob/f42b90d23c332bee6dedd1c8f44e07b9b1a52f74/compiler/src/iree/compiler/Codegen/LLVMCPU/Passes.cpp#L408 with funcPassManager.addPass(createLLVMCPUTileRootAndFuseInputOperands(i)); and try.

ziereis commented 5 days ago

@pashu123 i tested it with this example and a couple other ones that failed and they all compile with this fix.

pashu123 commented 5 days ago

@ziereis For context, this was introduced in https://github.com/iree-org/iree/pull/18114, but we only enabled this for convExpert Pipeline.

hanhanW commented 4 days ago

I can not reproduce the issue because the target CPU is not specified. Can you provide the log with --mlir-print-ir-after-all --mlir-disable-threading?

btw, I think the issue is not related to LLVMCPUTileRootAndFuseInputOperands? They are two reductions and they are formed into different dispatches. The issue is that we get large tile sizes in lowering_config.

pashu123 commented 4 days ago

I can not reproduce the issue because the target CPU is not specified. Can you provide the log with --mlir-print-ir-after-all --mlir-disable-threading?

btw, I think the issue is not related to LLVMCPUTileRootAndFuseInputOperands? They are two reductions and they are formed into different dispatches. The issue is that we get large tile sizes in lowering_config.

I was also surprised, but when I looked at the dispatches, the last was one fused unpack + reduction.

hanhanW commented 4 days ago

I see. I think they are batch_matmul in generic op form, so data-tiling is kicked in. And I don't have cpu_features because the cpu target is not specified, so those encodings are dropped. Thus I'm not able to reproduce it. It's easier if @ziereis can provide the IR dumps.

ziereis commented 4 days ago

sorry for not providing the flags. Here is the full command:

./build/tools/iree-compile --iree-hal-target-device=llvm-cpu --iree-llvmcpu-target-cpu=znver4 reproducer.mlir -o out.vmfb        

The ir dump is also attached

ir_after_all.txt