LLVM Error when compiling FP16 model for target CPU with IREE

suryajasper commented 1 year ago

Trying to compile FP16 quantized vicuna model on ARM64 CPU, but iree-compile is failing with LLVM error in one dispatch

Stack trace:

>>> iree-compile first_vicuna_fp16.mlir --iree-input-type=tm_tensor --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-embedded-linker-path=/home/nod/SHARK/shark.venv/lib/python3.11/site-packages/iree/compiler/tools/../_mlir_libs/iree-lld --mlir-print-debuginfo --mlir-print-op-on-diagnostic=false --iree-llvmcpu-target-cpu-features=host --iree-llvmcpu-target-triple=aarch64-linux-gnu --iree-stream-resource-index-bits=64 --iree-vm-target-index-bits=64 --iree-vm-bytecode-module-strip-source-map=true --iree-util-zero-fill-elided-attrs --iree-hal-dump-executable-sources-to=ies --iree-vm-target-truncate-unsupported-floats --iree-codegen-check-ir-before-llvm-conversion=false --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-hal-dump-executable-benchmarks-to=../vicuna_dispatch_dump -o first_vicuna_fp16.vmfb | tee shit.log

LLVM ERROR: Do not know how to scalarize this operator's operand!

Please report issues to https://github.com/openxla/iree/issues and include the crash backtrace.
Stack dump:
0.      Program arguments: /home/nod/Documents/SHARK/shark.venv/lib/python3.11/site-packages/iree/compiler/tools/../_mlir_libs/iree-compile first_vicuna_fp16.mlir --iree-input-type=tm_tensor --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-embedded-linker-path=/home/nod/SHARK/shark.venv/lib/python3.11/site-packages/iree/compiler/tools/../_mlir_libs/iree-lld --mlir-print-debuginfo --mlir-print-op-on-diagnostic=false --iree-llvmcpu-target-cpu-features=host --iree-llvmcpu-target-triple=aarch64-linux-gnu --iree-stream-resource-index-bits=64 --iree-vm-target-index-bits=64 --iree-vm-bytecode-module-strip-source-map=true --iree-util-zero-fill-elided-attrs --iree-hal-dump-executable-sources-to=ies --iree-vm-target-truncate-unsupported-floats --iree-codegen-check-ir-before-llvm-conversion=false --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-hal-dump-executable-benchmarks-to=../vicuna_dispatch_dump -o first_vicuna_fp16.vmfb
1.      Running pass 'Function Pass Manager' on module 'first_vicuna_fp16_linked_llvm_cpu'.
2.      Running pass 'AArch64 Instruction Selection' on function '@forward_dispatch_18'
 #0 0x00007fd83cb0f398 (/home/nod/Documents/SHARK/shark.venv/lib/python3.11/site-packages/iree/compiler/_mlir_libs/libIREECompiler.so+0x6b36398)
 #1 0x00007fd83cb0d30c (/home/nod/Documents/SHARK/shark.venv/lib/python3.11/site-packages/iree/compiler/_mlir_libs/libIREECompiler.so+0x6b3430c)
 #2 0x00007fd835fca420 __restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x14420)
 #3 0x00007fd835e0700b raise /build/glibc-SzIz7B/glibc-2.31/signal/../sysdeps/unix/sysv/linux/raise.c:51:1
 #4 0x00007fd835de6859 abort /build/glibc-SzIz7B/glibc-2.31/stdlib/abort.c:81:7
 #5 0x00007fd83657dbbd (/home/nod/Documents/SHARK/shark.venv/lib/python3.11/site-packages/iree/compiler/_mlir_libs/libIREECompiler.so+0x5a4bbd)
 #6 0x00007fd83ca4fcf8 (/home/nod/Documents/SHARK/shark.venv/lib/python3.11/site-packages/iree/compiler/_mlir_libs/libIREECompiler.so+0x6a76cf8)
 #7 0x00007fd83ac5120b (/home/nod/Documents/SHARK/shark.venv/lib/python3.11/site-packages/iree/compiler/_mlir_libs/libIREECompiler.so+0x4c7820b)
 #8 0x00007fd83abfe347 (/home/nod/Documents/SHARK/shark.venv/lib/python3.11/site-packages/iree/compiler/_mlir_libs/libIREECompiler.so+0x4c25347)
 #9 0x00007fd83abfea90 (/home/nod/Documents/SHARK/shark.venv/lib/python3.11/site-packages/iree/compiler/_mlir_libs/libIREECompiler.so+0x4c25a90)
#10 0x00007fd83ab8bad4 (/home/nod/Documents/SHARK/shark.venv/lib/python3.11/site-packages/iree/compiler/_mlir_libs/libIREECompiler.so+0x4bb2ad4)
#11 0x00007fd83ab8e5ba (/home/nod/Documents/SHARK/shark.venv/lib/python3.11/site-packages/iree/compiler/_mlir_libs/libIREECompiler.so+0x4bb55ba)
#12 0x00007fd83ab90b16 (/home/nod/Documents/SHARK/shark.venv/lib/python3.11/site-packages/iree/compiler/_mlir_libs/libIREECompiler.so+0x4bb7b16)
#13 0x00007fd83b025264 (/home/nod/Documents/SHARK/shark.venv/lib/python3.11/site-packages/iree/compiler/_mlir_libs/libIREECompiler.so+0x504c264)
#14 0x00007fd83c7b8c18 (/home/nod/Documents/SHARK/shark.venv/lib/python3.11/site-packages/iree/compiler/_mlir_libs/libIREECompiler.so+0x67dfc18)
#15 0x00007fd83c7b8d8c (/home/nod/Documents/SHARK/shark.venv/lib/python3.11/site-packages/iree/compiler/_mlir_libs/libIREECompiler.so+0x67dfd8c)
#16 0x00007fd83c7b9e6e (/home/nod/Documents/SHARK/shark.venv/lib/python3.11/site-packages/iree/compiler/_mlir_libs/libIREECompiler.so+0x67e0e6e)
#17 0x00007fd836da0397 (/home/nod/Documents/SHARK/shark.venv/lib/python3.11/site-packages/iree/compiler/_mlir_libs/libIREECompiler.so+0xdc7397)
#18 0x00007fd836d99dd2 (/home/nod/Documents/SHARK/shark.venv/lib/python3.11/site-packages/iree/compiler/_mlir_libs/libIREECompiler.so+0xdc0dd2)
#19 0x00007fd83700b741 (/home/nod/Documents/SHARK/shark.venv/lib/python3.11/site-packages/iree/compiler/_mlir_libs/libIREECompiler.so+0x1032741)
#20 0x00007fd83906fdf1 (/home/nod/Documents/SHARK/shark.venv/lib/python3.11/site-packages/iree/compiler/_mlir_libs/libIREECompiler.so+0x3096df1)
#21 0x00007fd839070571 (/home/nod/Documents/SHARK/shark.venv/lib/python3.11/site-packages/iree/compiler/_mlir_libs/libIREECompiler.so+0x3097571)
#22 0x00007fd839071ed3 (/home/nod/Documents/SHARK/shark.venv/lib/python3.11/site-packages/iree/compiler/_mlir_libs/libIREECompiler.so+0x3098ed3)
#23 0x00007fd83700d24a (/home/nod/Documents/SHARK/shark.venv/lib/python3.11/site-packages/iree/compiler/_mlir_libs/libIREECompiler.so+0x103424a)
#24 0x00007fd83906fdf1 (/home/nod/Documents/SHARK/shark.venv/lib/python3.11/site-packages/iree/compiler/_mlir_libs/libIREECompiler.so+0x3096df1)
#25 0x00007fd839070571 (/home/nod/Documents/SHARK/shark.venv/lib/python3.11/site-packages/iree/compiler/_mlir_libs/libIREECompiler.so+0x3097571)
#26 0x00007fd83906eea7 (/home/nod/Documents/SHARK/shark.venv/lib/python3.11/site-packages/iree/compiler/_mlir_libs/libIREECompiler.so+0x3095ea7)
#27 0x00007fd83906f8bc (/home/nod/Documents/SHARK/shark.venv/lib/python3.11/site-packages/iree/compiler/_mlir_libs/libIREECompiler.so+0x30968bc)
#28 0x00007fd839070571 (/home/nod/Documents/SHARK/shark.venv/lib/python3.11/site-packages/iree/compiler/_mlir_libs/libIREECompiler.so+0x3097571)
#29 0x00007fd8390716ee (/home/nod/Documents/SHARK/shark.venv/lib/python3.11/site-packages/iree/compiler/_mlir_libs/libIREECompiler.so+0x30986ee)
#30 0x00007fd83665141b (/home/nod/Documents/SHARK/shark.venv/lib/python3.11/site-packages/iree/compiler/_mlir_libs/libIREECompiler.so+0x67841b)
#31 0x00007fd836649422 (/home/nod/Documents/SHARK/shark.venv/lib/python3.11/site-packages/iree/compiler/_mlir_libs/libIREECompiler.so+0x670422)
#32 0x00007fd83664bc6b (/home/nod/Documents/SHARK/shark.venv/lib/python3.11/site-packages/iree/compiler/_mlir_libs/libIREECompiler.so+0x672c6b)
#33 0x00007fd835de8083 __libc_start_main /build/glibc-SzIz7B/glibc-2.31/csu/../csu/libc-start.c:342:3
#34 0x000000000040108e _init (/home/nod/Documents/SHARK/shark.venv/lib/python3.11/site-packages/iree/compiler/tools/../_mlir_libs/iree-compile+0x40108e)

Failing dispatch MLIR (forward_dispatch_18)

hal.executable public @forward_dispatch_18 {
  hal.executable.variant public @embedded_elf_arm_64, target = <"llvm-cpu", "embedded-elf-arm_64", {cpu = "generic", cpu_features = "+xsaves,+sse2,-hreset,-avx512cd,-sha,+xsaveopt,-kl,-avxvnni,-mwaitx,-clzero,+sse4.2,+bmi,-cldemote,-widekl,-avx512f,-raoint,+xsavec,+lzcnt,-serialize,-avxvnniint8,+fsgsbase,+aes,+sse,-sse4a,-rdpru,-tbm,-avx512bf16,-rtm,+fma,-waitpkg,-amx-fp16,-avx512ifma,-avx512vp2intersect,+popcnt,-vaes,-prefetchi,+f16c,+avx2,+sahf,+xsave,-uintr,+fxsr,+sgx,-pconfig,-avx512er,-avx512fp16,-gfni,+rdseed,+bmi2,-movdir64b,-avx512vl,-pku,-xop,-avx512bw,-avx512vbmi,+prfchw,-rdpid,+sse3,+cx16,-vpclmulqdq,-avx512vbmi2,-enqcmd,-amx-bf16,+64bit,-amx-int8,-avx512pf,-ptwrite,-amx-tile,-lwp,-avx512vpopcntdq,-avx512dq,-avxneconvert,+mmx,-fma4,-avx512vnni,-avxifma,+avx,+cmov,+sse4.1,+movbe,+invpcid,+adx,-clwb,-prefetchwt1,-cmpccxadd,+ssse3,+cx8,+clflushopt,-tsxldtrk,+pclmul,+crc32,+rdrnd,-avx512bitalg,-shstk,-movdiri,-wbnoinvd,+reserve-x18", data_layout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128", native_vector_size = 16 : index, target_triple = "aarch64-unknown-unknown-eabi-elf", ukernels = false}> {
    hal.executable.export public @forward_dispatch_18 ordinal(0) layout(#hal.pipeline.layout<push_constants = 6, sets = [<0, bindings = [<0, storage_buffer, ReadOnly>, <1, storage_buffer>]>]>) {
    ^bb0(%arg0: !hal.device loc(callsite("first_vicuna_fp16.mlir":712:12 at "first_vicuna_fp16.mlir":26:3)), %arg1: index loc("first_vicuna_fp16.mlir":26:3)):
      %x, %y, %z = flow.dispatch.workgroup_count_from_slice %arg1 loc(callsite("first_vicuna_fp16.mlir":712:12 at "first_vicuna_fp16.mlir":26:3))
      hal.return %x, %y, %z : index, index, index loc(callsite("first_vicuna_fp16.mlir":712:12 at "first_vicuna_fp16.mlir":26:3))
    } loc(callsite("first_vicuna_fp16.mlir":712:12 at "first_vicuna_fp16.mlir":26:3))
    builtin.module {
      func.func @forward_dispatch_18() {
        %c32_i64 = arith.constant 32 : i64 loc(fused[callsite("first_vicuna_fp16.mlir":677:11 at "first_vicuna_fp16.mlir":26:3), callsite("first_vicuna_fp16.mlir":666:11 at "first_vicuna_fp16.mlir":26:3), callsite("first_vicuna_fp16.mlir":446:11 at "first_vicuna_fp16.mlir":26:3)])
        %0 = hal.interface.constant.load[0] : i32 loc(fused[callsite("first_vicuna_fp16.mlir":677:11 at "first_vicuna_fp16.mlir":26:3), callsite("first_vicuna_fp16.mlir":666:11 at "first_vicuna_fp16.mlir":26:3), callsite("first_vicuna_fp16.mlir":446:11 at "first_vicuna_fp16.mlir":26:3)])
        %1 = hal.interface.constant.load[1] : i32 loc(fused[callsite("first_vicuna_fp16.mlir":677:11 at "first_vicuna_fp16.mlir":26:3), callsite("first_vicuna_fp16.mlir":666:11 at "first_vicuna_fp16.mlir":26:3), callsite("first_vicuna_fp16.mlir":446:11 at "first_vicuna_fp16.mlir":26:3)])
        %2 = hal.interface.constant.load[2] : i32 loc(callsite("first_vicuna_fp16.mlir":712:12 at "first_vicuna_fp16.mlir":26:3))
        %3 = hal.interface.constant.load[3] : i32 loc(callsite("first_vicuna_fp16.mlir":712:12 at "first_vicuna_fp16.mlir":26:3))
        %4 = hal.interface.constant.load[4] : i32 loc("first_vicuna_fp16.mlir":26:3)
        %5 = hal.interface.constant.load[5] : i32 loc("first_vicuna_fp16.mlir":26:3)
        %6 = arith.extui %1 : i32 to i64 loc(fused[callsite("first_vicuna_fp16.mlir":677:11 at "first_vicuna_fp16.mlir":26:3), callsite("first_vicuna_fp16.mlir":666:11 at "first_vicuna_fp16.mlir":26:3), callsite("first_vicuna_fp16.mlir":446:11 at "first_vicuna_fp16.mlir":26:3)])
        %7 = arith.shli %6, %c32_i64 : i64 loc(fused[callsite("first_vicuna_fp16.mlir":677:11 at "first_vicuna_fp16.mlir":26:3), callsite("first_vicuna_fp16.mlir":666:11 at "first_vicuna_fp16.mlir":26:3), callsite("first_vicuna_fp16.mlir":446:11 at "first_vicuna_fp16.mlir":26:3)])
        %8 = arith.extui %0 : i32 to i64 loc(fused[callsite("first_vicuna_fp16.mlir":677:11 at "first_vicuna_fp16.mlir":26:3), callsite("first_vicuna_fp16.mlir":666:11 at "first_vicuna_fp16.mlir":26:3), callsite("first_vicuna_fp16.mlir":446:11 at "first_vicuna_fp16.mlir":26:3)])
        %9 = arith.ori %8, %7 : i64 loc(fused[callsite("first_vicuna_fp16.mlir":677:11 at "first_vicuna_fp16.mlir":26:3), callsite("first_vicuna_fp16.mlir":666:11 at "first_vicuna_fp16.mlir":26:3), callsite("first_vicuna_fp16.mlir":446:11 at "first_vicuna_fp16.mlir":26:3)])
        %10 = arith.index_castui %9 : i64 to index loc(fused[callsite("first_vicuna_fp16.mlir":677:11 at "first_vicuna_fp16.mlir":26:3), callsite("first_vicuna_fp16.mlir":666:11 at "first_vicuna_fp16.mlir":26:3), callsite("first_vicuna_fp16.mlir":446:11 at "first_vicuna_fp16.mlir":26:3)])
        %11 = arith.extui %3 : i32 to i64 loc(callsite("first_vicuna_fp16.mlir":712:12 at "first_vicuna_fp16.mlir":26:3))
        %12 = arith.shli %11, %c32_i64 : i64 loc(callsite("first_vicuna_fp16.mlir":712:12 at "first_vicuna_fp16.mlir":26:3))
        %13 = arith.extui %2 : i32 to i64 loc(callsite("first_vicuna_fp16.mlir":712:12 at "first_vicuna_fp16.mlir":26:3))
        %14 = arith.ori %13, %12 : i64 loc(callsite("first_vicuna_fp16.mlir":712:12 at "first_vicuna_fp16.mlir":26:3))
        %15 = arith.index_castui %14 : i64 to index loc(callsite("first_vicuna_fp16.mlir":712:12 at "first_vicuna_fp16.mlir":26:3))
        %16 = arith.extui %5 : i32 to i64 loc("first_vicuna_fp16.mlir":26:3)
        %17 = arith.shli %16, %c32_i64 : i64 loc("first_vicuna_fp16.mlir":26:3)
        %18 = arith.extui %4 : i32 to i64 loc("first_vicuna_fp16.mlir":26:3)
        %19 = arith.ori %18, %17 : i64 loc("first_vicuna_fp16.mlir":26:3)
        %20 = arith.index_castui %19 : i64 to index loc("first_vicuna_fp16.mlir":26:3)
        %21 = flow.dispatch.workload.ordinal %20, 0 : index loc(callsite("first_vicuna_fp16.mlir":712:12 at "first_vicuna_fp16.mlir":26:3))
        %22 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%10) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<32x?x?xf16>>{%21, %21} loc(fused[callsite("first_vicuna_fp16.mlir":677:11 at "first_vicuna_fp16.mlir":26:3), callsite("first_vicuna_fp16.mlir":666:11 at "first_vicuna_fp16.mlir":26:3), callsite("first_vicuna_fp16.mlir":446:11 at "first_vicuna_fp16.mlir":26:3)])
        %23 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%15) : !flow.dispatch.tensor<writeonly:tensor<32x?x?xf16>>{%21, %21} loc(callsite("first_vicuna_fp16.mlir":712:12 at "first_vicuna_fp16.mlir":26:3))
        %24 = flow.dispatch.tensor.load %22, offsets = [0, 0, 0], sizes = [32, %21, %21], strides = [1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<32x?x?xf16>>{%21, %21} -> tensor<32x?x?xf16> loc(callsite("first_vicuna_fp16.mlir":712:12 at "first_vicuna_fp16.mlir":26:3))
        %25 = tensor.empty(%21, %21) : tensor<32x?x?xf16> loc(callsite("first_vicuna_fp16.mlir":661:11 at "first_vicuna_fp16.mlir":26:3))
        %26 = iree_linalg_ext.softmax dimension(2) ins(%24 : tensor<32x?x?xf16>) outs(%25 : tensor<32x?x?xf16>) -> tensor<32x?x?xf16> loc(callsite("first_vicuna_fp16.mlir":712:12 at "first_vicuna_fp16.mlir":26:3))
        flow.dispatch.tensor.store %26, %23, offsets = [0, 0, 0], sizes = [32, %21, %21], strides = [1, 1, 1] : tensor<32x?x?xf16> -> !flow.dispatch.tensor<writeonly:tensor<32x?x?xf16>>{%21, %21} loc(callsite("first_vicuna_fp16.mlir":712:12 at "first_vicuna_fp16.mlir":26:3))
        return loc(callsite("first_vicuna_fp16.mlir":712:12 at "first_vicuna_fp16.mlir":26:3))
      } loc(callsite("first_vicuna_fp16.mlir":712:12 at "first_vicuna_fp16.mlir":26:3))
    } loc(callsite("first_vicuna_fp16.mlir":712:12 at "first_vicuna_fp16.mlir":26:3))
  } loc(callsite("first_vicuna_fp16.mlir":712:12 at "first_vicuna_fp16.mlir":26:3))
} loc(callsite("first_vicuna_fp16.mlir":712:12 at "first_vicuna_fp16.mlir":26:3))

powderluv commented 1 year ago

@bjacob @MaheshRavishankar please let us know if you have any guidance

bjacob commented 1 year ago

Thanks for the report. I've minimized the testcase and filed 2 issues as a result:

14186 , for the aarch64 case as in this original repro
14187 , for the x86-64 case, running into a different error which aarch64 might eventually also run into once the immediate issue #14186 is resolved.

The over-arching theme seems to be that this might be the first time we're compiling a FP16 Softmax, and some work is needed to make that work :-)

Also a note about the original testcase above:

The flag --iree-llvmcpu-target-cpu-features=host is unwanted here. It tells iree-compile to target the same CPU as it's running on. It only makes sense when the target triple matches the host architecture, which apparently it does not here: the target triple in the repro command line says aarch64 but the generated IR attached above has cpu_features enumerating x86-64 features, apparently coming from that --iree-llvmcpu-target-cpu-features=host flag, resulting in a log of warnings about unknown features for this architecture.

bjacob commented 1 year ago

Note you have an easy way out here, which is to convert FP16->FP32 and back around each softmax. You can do that manually in the source IR or you could write a compiler pass for that if it doesn't exist already. @dcaballe might know.

MaheshRavishankar commented 1 year ago

iree_linalg_ext.softmax on tensor<?xf16> causes linker error: undefined symbol: fmaxf #14187 , for the x86-64 case, running into a different error which aarch64 might eventually also run into once the immediate issue

https://github.com/openxla/iree/pull/13808 should fix this issue.

allieculp commented 1 year ago

@bjacob @MaheshRavishankar Please confirm if #13808 fixed this.

hanhanW commented 1 year ago

@bjacob @MaheshRavishankar Please confirm if #13808 fixed this.

It does not fixe the issue. It still trigger the issue mentioned in https://github.com/openxla/iree/issues/14186 We have to fix that one first.

iree-org / iree

LLVM Error when compiling FP16 model for target CPU with IREE #14182

14186 , for the `aarch64` case as in this original repro

14187 , for the `x86-64` case, running into a different error which `aarch64` might eventually also run into once the immediate issue #14186 is resolved.

iree-org / iree

LLVM Error when compiling FP16 model for target CPU with IREE #14182

14186 , for the aarch64 case as in this original repro

14187 , for the x86-64 case, running into a different error which aarch64 might eventually also run into once the immediate issue #14186 is resolved.

14186 , for the `aarch64` case as in this original repro

14187 , for the `x86-64` case, running into a different error which `aarch64` might eventually also run into once the immediate issue #14186 is resolved.