iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.8k stars 604 forks source link

IREE can't handle the dynamic shape for `torch.aten.bmm` op in tinyllama 1.1B model with `unpack` ukernel. #18898

Open JerryShih opened 2 days ago

JerryShih commented 2 days ago

What happened?

With --iree-llvmcpu-enable-ukernels=all or --iree-llvmcpu-enable-ukernels=unpack, the IREE will report the following message for tinyllama model in LLVMCPUCheckIRBeforeLLVMConversionPass pass:

error: 'memref.alloca' op expected no unbounded stack allocations
%2 = torch.aten.bmm %0, %1 : !torch.vtensor<[32,?,64],f32>, !torch.vtensor<[32,64,?],f32> -> !torch.vtensor<[32,?,?],f32>
       ^

It uses dynamic tensors as bmm's input.

Steps to reproduce your issue

Here is the simplified tinyllama model with dynamic shape: simple_tinyllama.mlir

func.func @test_torch_bmm(%A:tensor<32x?x64xf32>, %B:tensor<32x64x?xf32> ) -> (tensor<32x?x?xf32>) {
  %0 = torch_c.from_builtin_tensor %A : tensor<32x?x64xf32> -> !torch.vtensor<[32,?,64],f32>
  %1 = torch_c.from_builtin_tensor %B : tensor<32x64x?xf32> -> !torch.vtensor<[32,64,?],f32>
  %2 = torch.aten.bmm %0, %1 : !torch.vtensor<[32,?,64],f32>, !torch.vtensor<[32,64,?],f32> -> !torch.vtensor<[32,?,?],f32>
  %3 = torch_c.to_builtin_tensor %2 : !torch.vtensor<[32,?,?],f32> -> tensor<32x?x?xf32>
  return %3: tensor<32x?x?xf32>
}
  1. build IREE compiler with default options.
  2. use iree-compiler with follow commands
    ./tools/iree-compile \
    --iree-hal-target-backends=llvm-cpu \
    --iree-opt-data-tiling \
    --iree-llvmcpu-enable-ukernels=all \
    --output-format=vm-bytecode \
    --iree-llvmcpu-number-of-threads=1 \
    simple_tinyllama.mlir \
    -o simple_tinyllama.mlir.vmfb
  3. got error messages:
    simple_tinyllama.mlir:4:8: error: 'memref.alloca' op expected no unbounded stack allocations
    %2 = torch.aten.bmm %0, %1 : !torch.vtensor<[32,?,64],f32>, !torch.vtensor<[32,64,?],f32> -> !torch.vtensor<[32,?,?],f32>
       ^
    simple_tinyllama.mlir:4:8: note: see current operation: %53 = "memref.alloca"(%50, %52) <{alignment = 64 : i64, operandSegmentSizes = array<i32: 2, 0>}> : (index, index) -> memref<1x?x?xf32>
    simple_tinyllama.mlir:4:8: error: failed to run translation of source executable to target executable for backend #hal.executable.target<"llvm-cpu", "embedded-elf-x86_64", {cpu = "", cpu_features = "", data_layout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128", native_vector_size = 16 : i64, target_triple = "x86_64-unknown-unknown-eabi-elf", ukernels = "all"}>
    %2 = torch.aten.bmm %0, %1 : !torch.vtensor<[32,?,64],f32>, !torch.vtensor<[32,64,?],f32> -> !torch.vtensor<[32,?,?],f32>
       ^

What component(s) does this issue relate to?

Compiler

Version information

IREE: https://github.com/iree-org/iree/commit/3b751a4d2797d29422e08327b1a53933448a26fd

Additional context

No response

MaheshRavishankar commented 2 days ago

Couple of things here.

1) I would be very careful about using the --iree-llvmcpu-number-of-threads=1 . See https://github.com/iree-org/iree/blob/0c2c627747586ed39ce7b1f6bfc9d8b83c4a4e69/compiler/src/iree/compiler/Codegen/LLVMCPU/KernelDispatch.cpp#L44 .

Without these flags if I compile using

iree-compile  --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-enable-ukernels=all

This compiles fine and actually looking at generated code it does what you expect but I also get this warning that is relevant.

This can be done in two ways:
1. With command-line flags:
    --iree-llvmcpu-target-cpu=...
    --iree-llvmcpu-target-cpu-features=...
2. Within the IR:
    #hal.executable.target< ... , cpu="...", cpu_features="...">

In the rest of this message, these fields are referred to as just `cpu` and `cpu_features`.

Examples:

    cpu=generic
        Target a generic CPU of the target architecture. The generated code will have poor performance, but will run on any CPU.

    cpu=host
        Target the host CPU. The generated code will have optimal performance on the host CPU but will crash on other CPUs not supporting the same CPU features.

    cpu="name"
        Target a specific CPU. This is mostly used on x86. The accepted values are the same as in Clang command lines.
        List of accepted x86 CPUs: nocona, core2, penryn, bonnell, atom, silvermont, slm, goldmont, goldmont-plus, tremont, nehalem, corei7, westmere, sandybridge, corei7-avx, ivybridge, core-avx-i, haswell, core-avx2, broadwell, skylake, skylake-avx512, skx, cascadelake, cooperlake, cannonlake, icelake-client, rocketlake, icelake-server, tigerlake, sapphirerapids, alderlake, raptorlake, meteorlake, arrowlake, arrowlake-s, lunarlake, gracemont, pantherlake, sierraforest, grandridge, graniterapids, graniterapids-d, emeraldrapids, clearwaterforest, knl, knm, k8, athlon64, athlon-fx, opteron, k8-sse3, athlon64-sse3, opteron-sse3, amdfam10, barcelona, btver1, btver2, bdver1, bdver2, bdver3, bdver4, znver1, znver2, znver3, znver4, znver5, x86-64, x86-64-v2, x86-64-v3, x86-64-v4

    cpu_features="+feature1,..."
        Target a CPU supporting the comma-separated of (+-prefixed) features. The accepted values are the same as in Clang command lines.