IREE can't handle the dynamic shape for `torch.aten.bmm` op in tinyllama 1.1B model with `unpack` ukernel.

iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.

Apache License 2.0

2.8k stars 604 forks source link

What happened?

With --iree-llvmcpu-enable-ukernels=all or --iree-llvmcpu-enable-ukernels=unpack, the IREE will report the following message for tinyllama model in LLVMCPUCheckIRBeforeLLVMConversionPass pass:

error: 'memref.alloca' op expected no unbounded stack allocations
%2 = torch.aten.bmm %0, %1 : !torch.vtensor<[32,?,64],f32>, !torch.vtensor<[32,64,?],f32> -> !torch.vtensor<[32,?,?],f32>
       ^

It uses dynamic tensors as bmm's input.

Steps to reproduce your issue

Here is the simplified tinyllama model with dynamic shape: simple_tinyllama.mlir

func.func @test_torch_bmm(%A:tensor<32x?x64xf32>, %B:tensor<32x64x?xf32> ) -> (tensor<32x?x?xf32>) {
  %0 = torch_c.from_builtin_tensor %A : tensor<32x?x64xf32> -> !torch.vtensor<[32,?,64],f32>
  %1 = torch_c.from_builtin_tensor %B : tensor<32x64x?xf32> -> !torch.vtensor<[32,64,?],f32>
  %2 = torch.aten.bmm %0, %1 : !torch.vtensor<[32,?,64],f32>, !torch.vtensor<[32,64,?],f32> -> !torch.vtensor<[32,?,?],f32>
  %3 = torch_c.to_builtin_tensor %2 : !torch.vtensor<[32,?,?],f32> -> tensor<32x?x?xf32>
  return %3: tensor<32x?x?xf32>
}

build IREE compiler with default options.

use iree-compiler with follow commands

./tools/iree-compile \
--iree-hal-target-backends=llvm-cpu \
--iree-opt-data-tiling \
--iree-llvmcpu-enable-ukernels=all \
--output-format=vm-bytecode \
--iree-llvmcpu-number-of-threads=1 \
simple_tinyllama.mlir \
-o simple_tinyllama.mlir.vmfb

got error messages:

simple_tinyllama.mlir:4:8: error: 'memref.alloca' op expected no unbounded stack allocations
%2 = torch.aten.bmm %0, %1 : !torch.vtensor<[32,?,64],f32>, !torch.vtensor<[32,64,?],f32> -> !torch.vtensor<[32,?,?],f32>
   ^
simple_tinyllama.mlir:4:8: note: see current operation: %53 = "memref.alloca"(%50, %52) <{alignment = 64 : i64, operandSegmentSizes = array<i32: 2, 0>}> : (index, index) -> memref<1x?x?xf32>
simple_tinyllama.mlir:4:8: error: failed to run translation of source executable to target executable for backend #hal.executable.target<"llvm-cpu", "embedded-elf-x86_64", {cpu = "", cpu_features = "", data_layout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128", native_vector_size = 16 : i64, target_triple = "x86_64-unknown-unknown-eabi-elf", ukernels = "all"}>
%2 = torch.aten.bmm %0, %1 : !torch.vtensor<[32,?,64],f32>, !torch.vtensor<[32,64,?],f32> -> !torch.vtensor<[32,?,?],f32>
   ^

What component(s) does this issue relate to?

Compiler

Version information

IREE: https://github.com/iree-org/iree/commit/3b751a4d2797d29422e08327b1a53933448a26fd

Additional context

No response

This can be done in two ways: 1. With command-line flags: --iree-llvmcpu-target-cpu=... --iree-llvmcpu-target-cpu-features=... 2. Within the IR: #hal.executable.target< ... , cpu="...", cpu_features="..."> In the rest of this message, these fields are referred to as just `cpu` and `cpu_features`. Examples: cpu=generic Target a generic CPU of the target architecture. The generated code will have poor performance, but will run on any CPU. cpu=host Target the host CPU. The generated code will have optimal performance on the host CPU but will crash on other CPUs not supporting the same CPU features. cpu="name" Target a specific CPU. This is mostly used on x86. The accepted values are the same as in Clang command lines. List of accepted x86 CPUs: nocona, core2, penryn, bonnell, atom, silvermont, slm, goldmont, goldmont-plus, tremont, nehalem, corei7, westmere, sandybridge, corei7-avx, ivybridge, core-avx-i, haswell, core-avx2, broadwell, skylake, skylake-avx512, skx, cascadelake, cooperlake, cannonlake, icelake-client, rocketlake, icelake-server, tigerlake, sapphirerapids, alderlake, raptorlake, meteorlake, arrowlake, arrowlake-s, lunarlake, gracemont, pantherlake, sierraforest, grandridge, graniterapids, graniterapids-d, emeraldrapids, clearwaterforest, knl, knm, k8, athlon64, athlon-fx, opteron, k8-sse3, athlon64-sse3, opteron-sse3, amdfam10, barcelona, btver1, btver2, bdver1, bdver2, bdver3, bdver4, znver1, znver2, znver3, znver4, znver5, x86-64, x86-64-v2, x86-64-v3, x86-64-v4 cpu_features="+feature1,..." Target a CPU supporting the comma-separated of (+-prefixed) features. The accepted values are the same as in Clang command lines.

iree-org / iree