[Gemma-7b] SIGABRT when running with DT or DT+UK on Pixel 8

dcaballe commented 7 months ago

It looks like Gemma can only run on Pixel 8 if we use the non-DT path for now. Using DT or DT+UK leads to a SIGABRT signal which is probably due to running OOM. This is the info I can see in the tombstone:

pid: 14294, tid: 14294, name: iree-benchmark-  >>> iree-benchmark-module <<<
uid: 0
tagged_addr_ctrl: 0000000000000001 (PR_TAGGED_ADDR_ENABLE)
pac_enabled_keys: 000000000000000f (PR_PAC_APIAKEY, PR_PAC_APIBKEY, PR_PAC_APDAKEY, PR_PAC_APDBKEY)
signal 6 (SIGABRT), code -1 (SI_QUEUE), fault addr --------
    x0  0000000000000000  x1  00000000000037d6  x2  0000000000000006  x3  0000007ff3e25680
    x4  000000000000000a  x5  000000000000000a  x6  000000000000000a  x7  7f7f7f7f7f7f7f7f
    x8  00000000000000f0  x9  0000006fc40ab200  x10 0000000000000001  x11 0000006fc40f5ba0
    x12 0000000000000002  x13 0000007ff3e25540  x14 0000000000000000  x15 0000000000000000
    x16 0000006fc4160fc8  x17 0000006fc413e160  x18 0000006fc6df4000  x19 00000000000037d6
    x20 00000000000037d6  x21 00000000ffffffff  x22 0000000000030002  x23 b400006e74049bb0
    x24 0000005d19dc0cf4  x25 b400006dd404ec90  x26 0000006fc6500000  x27 0000005d19e89488
    x28 0000000000000001  x29 0000007ff3e25700
    lr  0000006fc40e6e48  sp  0000007ff3e25660  pc  0000006fc40e6e74  pst 0000000000001000

...

9 total frames
backtrace:
      #00 pc 000000000001bd28  /data/local/tmp/llms/Gemma/objs/iree_dylib_a3XjLM_mem_.so (_initializer_57_dispatch_0_pack_f32+200)
      #01 pc 00000000000d12c8  /data/local/tmp/llms/iree-benchmark-module (iree_hal_system_executable_issue_call+44) (BuildId: 9c953a3f1be794331153f6453347a38830d9ea87)
      #02 pc 00000000000d12c8  /data/local/tmp/llms/iree-benchmark-module (iree_hal_system_executable_issue_call+44) (BuildId: 9c953a3f1be794331153f6453347a38830d9ea87)
      #03 pc 00000000000aaf48  /data/local/tmp/llms/iree-benchmark-module (iree_hal_cmd_dispatch_tile+216) (BuildId: 9c953a3f1be794331153f6453347a38830d9ea87)
      #04 pc 00000000000afe98  /data/local/tmp/llms/iree-benchmark-module (iree_task_dispatch_shard_execute+248) (BuildId: 9c953a3f1be794331153f6453347a38830d9ea87)
      #05 pc 00000000000b10f0  /data/local/tmp/llms/iree-benchmark-module (iree_task_worker_main+364) (BuildId: 9c953a3f1be794331153f6453347a38830d9ea87)
      #06 pc 00000000000bb004  /data/local/tmp/llms/iree-benchmark-module (iree_thread_start_routine+240) (BuildId: 9c953a3f1be794331153f6453347a38830d9ea87)
      #07 pc 00000000000ca7cc  /apex/com.android.runtime/lib64/bionic/libc.so (__pthread_start(void*)+204) (BuildId: 33ad5959e2b38fc822cda3c642e16c94)
      #08 pc 00000000000607b0  /apex/com.android.runtime/lib64/bionic/libc.so (__start_thread+64) (BuildId: 33ad5959e2b38fc822cda3c642e16c94)

The backtrace points to _initializer_57_dispatch_0_pack_f32, which looks like this (%5 looks like a huge f32 allocation):

hal.executable public @_initializer_57_dispatch_0 {
  hal.executable.variant public @system_elf_arm_64 target(<"llvm-cpu", "system-elf-arm_64", {cpu = "", cpu_features = "+v9a,+fullfp16,+fp-armv8,+neon,+aes,+sha2,+crc,+lse,+rdm,+complxnum,+rcpc,+sha3,+sm4,+dotprod,+fp16fml,+dit,+flagm,+ssbs,+sb,+sve2-aes,+sve2-bitperm,+sve2-sha3,+sve2-sm4,+altnzcv,+fptoint,+bf16,+i8mm,+bti,+mte,+pauth,+perfmon,+predres,+spe,+ras,+reserve-x18", data_layout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128", debug_symbols = false, link_embedded = false, native_vector_size = 16 : i64, target_triple = "aarch64-none-linux-android34", ukernels = "none"}>) {
    hal.executable.export public @_initializer_57_dispatch_0_pack_f32 ordinal(0) layout(#hal.pipeline.layout<push_constants = 1, sets = [<0, bindings = [<0, storage_buffer, ReadOnly>, <1, storage_buffer>]>]>) attributes {hal.interface.bindings = [#hal.interface.binding<0, 0>, #hal.interface.binding<0, 1>]} {
    ^bb0(%arg0: !hal.device):
      %x, %y, %z = flow.dispatch.workgroup_count_from_slice 
      hal.return %x, %y, %z : index, index, index
    }
    builtin.module {
      func.func @_initializer_57_dispatch_0_pack_f32() {
        %c0 = arith.constant 0 : index
        %0 = hal.interface.constant.load[0] : i32
        %1 = arith.index_castui %0 : i32 to index
        %2 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<256000x3072xf32>>
        %3 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%1) : !flow.dispatch.tensor<writeonly:tensor<32000x3072x8x1xf32>>
        %4 = flow.dispatch.tensor.load %2, offsets = [0, 0], sizes = [256000, 3072], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<256000x3072xf32>> -> tensor<256000x3072xf32>
        %5 = tensor.empty() : tensor<32000x3072x8x1xf32>
        %pack = tensor.pack %4 outer_dims_perm = [0, 1] inner_dims_pos = [0, 1] inner_tiles = [8, 1] into %5 : tensor<256000x3072xf32> -> tensor<32000x3072x8x1xf32>
        flow.dispatch.tensor.store %pack, %3, offsets = [0, 0, 0, 0], sizes = [32000, 3072, 8, 1], strides = [1, 1, 1, 1] : tensor<32000x3072x8x1xf32> -> !flow.dispatch.tensor<writeonly:tensor<32000x3072x8x1xf32>>
        return
      }
    }
  }
}

Just a few ideas: we may want to look at what we are packing here and if we end up duplicating the same tensors with different layouts. The fact that the dispatch is dispatch_0 suggests that we might be packing constants/inputs so perhaps there is a missing hoisting or some input preprocessing that we can do to reduce the memory footprint.

To repro:

Download Gemma from https://discord.com/channels/689900678990135345/1146173056537079919/1212949110718730260.

Compile it with:

iree-compile --iree-hal-target-backends=llvm-cpu --iree-input-type=auto --iree-llvmcpu-target-cpu-features=+v9a,+fullfp16,+fp-armv8,+neon,+aes,+sha2,+crc,+lse,+rdm,+complxnum,+rcpc,+sha3,+sm4,+dotprod,+fp16fml,+dit,+flagm,+ssbs,+sb,+sve2-aes,+sve2-bitperm,+sve2-sha3,+sve2-sm4,+altnzcv,+fptoint,+bf16,+i8mm,+bti,+mte,+pauth,+perfmon,+predres,+spe,+ras --iree-llvmcpu-target-triple=aarch64-none-linux-android34 --iree-llvmcpu-link-embedded=false --iree-input-demote-f64-to-f32=false --iree-input-demote-i64-to-i32=false --iree-opt-data-tiling=true --iree-llvmcpu-enable-ukernels=all --iree-llvmcpu-use-fast-min-max-ops gemma_7b.mlir -o gemma_7b_uk.vmfb

Run it on a Pixel 8 with:

iree-benchmark-module '--device=local-task' '--task_topology_cpu_ids=4,5,6,7,8' '--module=gemma_7b_cg.vmfb' '--function=run_forward' '--input=1x1xi64=0' '--parameters=model=./gf32.safetensors'

pzread commented 7 months ago

A small note: _initializer_57_dispatch_0 packs tensor<256000x3072xf32> into tensor<32000x3072x8x1xf32>. This already takes ~6GB ram by itself during packing

benvanik commented 7 months ago

yeah, it's critical that we get either compile-time or deploy-time packing implemented - may be able to skate by with small models (inefficiently), but doesn't work on big ones.

iree-org / iree

[Gemma-7b] SIGABRT when running with DT or DT+UK on Pixel 8 #16998