[llm perf] Slow kernel turbine_llm_mmtfp_3d_8640_3200_f32f16

stellaraccident commented 7 months ago

// iree-compile --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-target-cpu-features=host -o turbine_llm_mmtfp_3d_8640_3200_f32f16_cpu.vmfb turbine_llm_mmtfp_3d_8640_3200_f32f16.mlir
// iree-benchmark-module --module=turbine_llm_mmtfp_3d_8640_3200_f32f16_cpu.vmfb --function=turbine_llm_mmtfp_3d_8640_3200_f32f16 --input=4x128x3200xf32 --input=8640x3200xf16

#map = affine_map<(d0, d1, d2) -> (d1, d2)>
#map1 = affine_map<(d0, d1, d2) -> (d0, d1, d2)>
module {
  util.func public @turbine_llm_mmtfp_3d_8640_3200_f32f16(%arg0: tensor<?x?x3200xf32>, %arg1: tensor<8640x3200xf16>) -> tensor<?x?x8640xf32> {
    %cst = arith.constant 0.000000e+00 : f32
    %c0 = arith.constant 0 : index
    %c1 = arith.constant 1 : index
    %dim = tensor.dim %arg0, %c0 : tensor<?x?x3200xf32>
    %dim_0 = tensor.dim %arg0, %c1 : tensor<?x?x3200xf32>
    %0 = tensor.empty(%dim) : tensor<?x8640x3200xf16>
    %1 = linalg.generic {indexing_maps = [#map, #map1], iterator_types = ["parallel", "parallel", "parallel"]} ins(%arg1 : tensor<8640x3200xf16>) outs(%0 : tensor<?x8640x3200xf16>) {
    ^bb0(%in: f16, %out: f16):
      linalg.yield %in : f16
    } -> tensor<?x8640x3200xf16>
    %2 = tensor.empty(%dim, %dim_0) : tensor<?x?x8640xf32>
    %3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<?x?x8640xf32>) -> tensor<?x?x8640xf32>
    %4 = linalg.batch_matmul_transpose_b ins(%arg0, %1 : tensor<?x?x3200xf32>, tensor<?x8640x3200xf16>) outs(%3 : tensor<?x?x8640xf32>) -> tensor<?x?x8640xf32>
    util.return %4 : tensor<?x?x8640xf32>
  }
}

Tested on CPU. Performance is at least an order of magnitude below expectations. Needs to be fast on all supported backends.

bjacob commented 7 months ago

What CPU are you measuring on? Here on AMD 7950X3D, setting 1 thread (to be able to make sense of single-thread performance on this CPU) I get items_per_second=2.66772/s, which amounts to 75 Gflop/s (counting each multiply-add as two ops as usual).

https://www.google.com/search?q=4*128*3200*8640*2*2.6677*1e-9

On this CPU, a f32 x f32 matmul kernel microbenchmark does 175 Gflop/s:

./runtime/src/iree/builtins/ukernel/tools/mmt4d_benchmark --benchmark_filter=f32f32f32_tile_16x16x1_avx512

So I'd say I can reproduce a up to ~ 2x slowness here, not exactly "an order of magnitude".

The codegen isn't bad at all: compiling with --iree-hal-dump-executable-intermediates-to=/tmp --x86-asm-syntax=intel I get this inner loop:

.LBB2_11:
    .loc    1 0 3
    vcvtph2ps   zmm16, ymmword ptr [r11 + rax]
    vfmadd231ps zmm0, zmm16, dword ptr [r12 + 2*rax - 60]{1to16}
    vfmadd231ps zmm2, zmm16, dword ptr [r12 + 2*rax - 56]{1to16}
    vfmadd231ps zmm1, zmm16, dword ptr [r12 + 2*rax - 52]{1to16}
    vfmadd231ps zmm4, zmm16, dword ptr [r12 + 2*rax - 48]{1to16}
    vfmadd231ps zmm3, zmm16, dword ptr [r12 + 2*rax - 44]{1to16}
    vfmadd231ps zmm6, zmm16, dword ptr [r12 + 2*rax - 40]{1to16}
    vfmadd231ps zmm5, zmm16, dword ptr [r12 + 2*rax - 36]{1to16}
    vfmadd231ps zmm8, zmm16, dword ptr [r12 + 2*rax - 32]{1to16}
    vfmadd231ps zmm7, zmm16, dword ptr [r12 + 2*rax - 28]{1to16}
    vfmadd231ps zmm10, zmm16, dword ptr [r12 + 2*rax - 24]{1to16}
    vfmadd231ps zmm9, zmm16, dword ptr [r12 + 2*rax - 20]{1to16}
    vfmadd231ps zmm12, zmm16, dword ptr [r12 + 2*rax - 16]{1to16}
    vfmadd231ps zmm11, zmm16, dword ptr [r12 + 2*rax - 12]{1to16}
    vfmadd231ps zmm14, zmm16, dword ptr [r12 + 2*rax - 8]{1to16}
    vfmadd231ps zmm13, zmm16, dword ptr [r12 + 2*rax - 4]{1to16}
    vfmadd231ps zmm15, zmm16, dword ptr [r12 + 2*rax]{1to16}
    .loc    1 4 3
    add rax, 32
    cmp rax, 102400
    jne .LBB2_11

This is good -- I wouldn't do better in a ukernel. The f16 side is converted to f32 in the vcvtph2ps instruction at the start of the loop body, and then the rest is a normal AVX-512 f32 kernel. I don't think there's anything better to do on this target.

So the fact that we don't have ukernels for f32f16 cases like this doesn't matter --- in this instance, pure codegen is doing great.

Still the 2x gap between the observed e2e performance and the microbenchmark above (which benchmarks a kernel that's looking similar to this except for the single vcvtph2ps instruction) means that something outside of mmt4d is slow.

Anyone picking up the investigation from here:

Tracy this - what dispatch is taking a long time outside of the main mmt4d dispatch? (We can all suspect a pack/unpack dispatch, particularly the f16 pack dispatch, but need to confirm).
Look at disassembly for the f16 pack dispatch (above flags). Is it bad? OK I have copied it into this gist, but haven't looked at it: https://gist.github.com/bjacob/f7c3844deacf87909a74cfdc79080583
- EDIT - looked at it for just a minute, this doesn't look like it's using AVX-512 at all. Formally it's using a combination of SSE and AVX; would need a closer look that it's actually properly vectorized even to that.

hanhanW commented 7 months ago

What CPU are you measuring on? Here on AMD 7950X3D, setting 1 thread (to be able to make sense of single-thread performance on this CPU) I get items_per_second=2.66772/s, which amounts to 75 Gflop/s (counting each multiply-add as two ops as usual).

+1, I wonder the target CPU as well.

Thanks @bjacob for the great analysis! A potentially performance bug could be in packing on f16 types. I have been working on pack codegen on and off, but the work scoped in https://github.com/openxla/iree/issues/16314 is not finished yet. So +1 on what Benoit suggested. We need to tracy this.

@MaheshRavishankar is this one of the tasks that you mentioned @pashu123 could pick up? If so, he can start with what Benoit suggested.

stellaraccident commented 7 months ago

Will you accept "order of magnitude, where magnitude is a power of 2"? :)

Really, this was a test case of the methodology. This is a common kernel in a popular dataset, and I'd be interested to know why it isn't getting near where it should.

Here's a full example trace of a different case, run on a live model: https://drive.google.com/file/d/1lRXO0Eb9aIF3lZmG6-gN8zN0OGr6zZKm/view?usp=drive_link (this includes Q4_1 dequantization)

iree-cpuinfo 
sse3                 1
ssse3                1
sse4.1               1
sse4.2               1
sse4a                1
avx                  1
fma                  1
fma4                 0
xop                  0
f16c                 1
avx2                 1
avx512f              0
avx512cd             0
avx512vl             0
avx512dq             0
avx512bw             0
avx512ifma           0
avx512vbmi           0
avx512vpopcntdq      0
avx512vnni           0
avx512vbmi2          0
avx512bitalg         0
avx512bf16           0
avx512fp16           0
amx-tile             0
amx-int8             0
amx-bf16             0

stellaraccident commented 7 months ago

And here is a full run with q8_0 kernels on a 3B openllama: https://drive.google.com/file/d/1S-Dm7nlt2jyNfIBJkFhEEAybdd-mEuiD/view?usp=drive_link

stellaraccident commented 7 months ago

(when I say "trace" in this context, I mean, execution log dump of MLIR, VMFB, and a log file that tells you what shapes it was invoked with and observed execution time)

MaheshRavishankar commented 7 months ago

What CPU are you measuring on? Here on AMD 7950X3D, setting 1 thread (to be able to make sense of single-thread performance on this CPU) I get items_per_second=2.66772/s, which amounts to 75 Gflop/s (counting each multiply-add as two ops as usual).

+1, I wonder the target CPU as well.

Thanks @bjacob for the great analysis! A potentially performance bug could be in packing on f16 types. I have been working on pack codegen on and off, but the work scoped in #16314 is not finished yet. So +1 on what Benoit suggested. We need to tracy this.

@MaheshRavishankar is this one of the tasks that you mentioned @pashu123 could pick up? If so, he can start with what Benoit suggested.

Yes. Already spoke to @pashu123 about this. He is going to start looking into it.

pashu123 commented 7 months ago

Here's the benchmark dump:

Running iree-benchmark-module
Run on (16 X 5015.12 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 1024 KiB (x8)
  L3 Unified 16384 KiB (x1)
Load Average: 1.13, 0.32, 0.10
***WARNING*** Library was built as DEBUG. Timings may be affected.
--------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------------------------
BM_turbine_llm_mmtfp_3d_8640_3200_f32f16/process_time/real_time       77.3 ms          484 ms            9 items_per_second=12.9406/s

Here's the Tracy profile

@hanhanW. The potential perf bug remains in pack f16 dispatch. Though it was vectorized, the vector type is vector<1xf16> in the llvm ir dump.

Also, the unpack_f32 is not vectorized at all.

pashu123 commented 7 months ago

--iree-hal-dump-executable-intermediates-to=/tmp --x86-asm-syntax=intel

I will start working on enabling the right vectorization for pack f16. @hanhanW @MaheshRavishankar Sounds good?

hanhanW commented 7 months ago

The issue is not just vectorization.. Can you attach the dispatches to the issue? I wonder if this is LHS packing or RHS packing; I need to see the actual pack op. It could be amount of work (e.g., implement features, benchmark and analyze IRs/perf) or low-hanging fruit (e.g., set proper tile sizes).

To dump the dispatches, you can add --iree-hal-dump-executable-sources-to=$HOME/dump to iree-compile.

pashu123 commented 7 months ago

--iree-hal-dump-executable-sources-to=$HOME/dump

Here are the dispatches https://gist.github.com/pashu123/a291c7cfc6d2d47930234bf3257d46f8

hanhanW commented 7 months ago

So this is a broadcast + rhs packing dispatch. It is fine that the batch dimension is dynamic, because we will always tile it with size=1 in this case. There are few known issues:

We need to set larger tile sizes.
We need to enable flatten and vector linearization for CPU backends.

(2) may be tricky. @dcaballe and I had been looking at it, but we did not land it yet. There are issues in the draft PR, and we need someone to help triage. https://github.com/openxla/iree/pull/16456

With (1) and (2), we generate vector<2x16xf16> vectors in vectorization, and flatten it to vector<1x32xf16> during vector lowering.

hanhanW commented 7 months ago

@pashu123 can you attach the IR dumps for the dispatch?

hanhanW commented 7 months ago

@pashu123 and I looked the IR dump together today, and we found that the vector level tile sizes are all set to 1s in the lowering_config. My intuition is the logics in elementwise op strategy selection is outdated. We used to tile dims with size=1 when there are dynamic shapes. Because we did not have vectorization strategy. It only worked with static shapes. Today, we have peeling, masking, etc tricks, so we need to revisit it. Here are two action items after the discussion:

Try with different lowering_config and look at the IR dumps and final code. (maybe use [1, 1, 16] or [1, 2, 16] as vector level tile sizes)
Teach the KernelDispatch.cpp to produce such config.

To quickly iterate 1, we can preset translation_info and lowering_config on the op. E.g., see below example and run iree-opt --pass-pipeline='builtin.module(iree-llvmcpu-select-lowering-strategy, func.func(iree-llvmcpu-lower-executable-target))' repro.mlir

https://github.com/openxla/iree/blob/bd1b10626cb02d3d6c05f67977d1800020203b40/compiler/src/iree/compiler/Codegen/LLVMCPU/test/pipeline_tests.mlir#L300-L322

side note: please remember to update hal.executable.target in your experiments.

pashu123 commented 7 months ago

@pashu123 can you attach the IR dumps for the dispatch?

I have already shared https://gist.github.com/pashu123/bb3a999d6b3a60ab94f7d80c863b67fb, but I am attaching it here for completeness.

bjacob commented 7 months ago

please remember to update hal.executable.target in your experiments.

Indeed, this could be important. The above example has cpu_features = "+fma,+avx512f" and that is missing a number of potentially relevant CPU features that could be used in the pack codegen.

The exact type f16 is not relevant in itself, all what should matter is that it's a 2-byte type. So, f16-specific CPU features such as +f16c should not be relevant. But at least +avx512bw should be important here, and potentially any of the following as well: +avx,+avx2,+avx512vl,+avx512dq.

hanhanW commented 7 months ago

Thanks @pashu123 for the deep investigation in the past few days. Prashant and I looked at the issue closely, and we are making some progress. The below numbers are generated based on Prashant's artifacts. They are executed on my VM but it should be fine. Because the zen4 specific features are not used in the matmul. There are two major issues to me.

We don't select a reasonable tile sizes for tensor.pack op, so it is generating inefficient code.
The unpack op is not properly vectorized.

With proper lowering_config (which kicks in 16x16 transpose optimization), the performance breakdown (1-threaded) is:

batch_mmt4d: 368.71 ms
broadcast + pack_on_lhs: 55.55 ms
unpack: 2.29 ms
pack_on_rhs: 1.76 ms

With (1), we can improve broadcast + pack_on_lhs from 136 ms to 55.55 ms. With (2), we can improve unpack from 5.82 ms to 2.29 ms.

We are actually doing okay on broadcast + pack_on_lhs in the prototype. I wrote c++ code to profile memcpy, which takes 63 ms on my VM. There might be some overheads in my c++ code, because it is slower than broadcast + pack_on_lhs. But the numbers are very close, and they look okay to me.

Here is the ASM dump for the broadcast + pack_on_lhs: https://gist.github.com/hanhanW/f53428a81521d681a6c1d5cc8f65a017

We dont fully utilize 512 bits in the pack (where they are all ymm registers), but it is already sub-optimal. We can try to improve that if needed, it will require an amount of work.

@stellaraccident what is your expectation? How do we generate baseline numbers for the case?

@pashu123 please send PRs to fix (1) and (2). You already have something for (1); the prototype of (2) can be found at https://github.com/iree-org/iree/pull/17002 . There might be other issues in multi-threaded. Let's fix these two issues first, and revisit it later.

(side note: This is the microbenchmarks we've been using for the issue.)

benvanik commented 7 months ago

(a dispatch that broadcasts is odd - are we sure we need that?)

hanhanW commented 7 months ago

(a dispatch that broadcasts is odd - are we sure we need that?)

It could just be a snippet IR from the model. It does not matter here because it is free. We already need to pay for packing, and broadcast + pack is as fast as a single pack.

benvanik commented 7 months ago

It's not free if we could have packed 16x less - less to write and less to read :)

iree-org / iree

[llm perf] Slow kernel turbine_llm_mmtfp_3d_8640_3200_f32f16 #17022