Open stellaraccident opened 7 months ago
What CPU are you measuring on? Here on AMD 7950X3D, setting 1 thread (to be able to make sense of single-thread performance on this CPU) I get items_per_second=2.66772/s
, which amounts to 75 Gflop/s (counting each multiply-add as two ops as usual).
https://www.google.com/search?q=4*128*3200*8640*2*2.6677*1e-9
On this CPU, a f32 x f32
matmul kernel microbenchmark does 175 Gflop/s:
./runtime/src/iree/builtins/ukernel/tools/mmt4d_benchmark --benchmark_filter=f32f32f32_tile_16x16x1_avx512
So I'd say I can reproduce a up to ~ 2x slowness here, not exactly "an order of magnitude".
The codegen isn't bad at all: compiling with --iree-hal-dump-executable-intermediates-to=/tmp --x86-asm-syntax=intel
I get this inner loop:
.LBB2_11:
.loc 1 0 3
vcvtph2ps zmm16, ymmword ptr [r11 + rax]
vfmadd231ps zmm0, zmm16, dword ptr [r12 + 2*rax - 60]{1to16}
vfmadd231ps zmm2, zmm16, dword ptr [r12 + 2*rax - 56]{1to16}
vfmadd231ps zmm1, zmm16, dword ptr [r12 + 2*rax - 52]{1to16}
vfmadd231ps zmm4, zmm16, dword ptr [r12 + 2*rax - 48]{1to16}
vfmadd231ps zmm3, zmm16, dword ptr [r12 + 2*rax - 44]{1to16}
vfmadd231ps zmm6, zmm16, dword ptr [r12 + 2*rax - 40]{1to16}
vfmadd231ps zmm5, zmm16, dword ptr [r12 + 2*rax - 36]{1to16}
vfmadd231ps zmm8, zmm16, dword ptr [r12 + 2*rax - 32]{1to16}
vfmadd231ps zmm7, zmm16, dword ptr [r12 + 2*rax - 28]{1to16}
vfmadd231ps zmm10, zmm16, dword ptr [r12 + 2*rax - 24]{1to16}
vfmadd231ps zmm9, zmm16, dword ptr [r12 + 2*rax - 20]{1to16}
vfmadd231ps zmm12, zmm16, dword ptr [r12 + 2*rax - 16]{1to16}
vfmadd231ps zmm11, zmm16, dword ptr [r12 + 2*rax - 12]{1to16}
vfmadd231ps zmm14, zmm16, dword ptr [r12 + 2*rax - 8]{1to16}
vfmadd231ps zmm13, zmm16, dword ptr [r12 + 2*rax - 4]{1to16}
vfmadd231ps zmm15, zmm16, dword ptr [r12 + 2*rax]{1to16}
.loc 1 4 3
add rax, 32
cmp rax, 102400
jne .LBB2_11
This is good -- I wouldn't do better in a ukernel. The f16 side is converted to f32 in the vcvtph2ps
instruction at the start of the loop body, and then the rest is a normal AVX-512 f32 kernel. I don't think there's anything better to do on this target.
So the fact that we don't have ukernels for f32f16 cases like this doesn't matter --- in this instance, pure codegen is doing great.
Still the 2x gap between the observed e2e performance and the microbenchmark above (which benchmarks a kernel that's looking similar to this except for the single vcvtph2ps
instruction) means that something outside of mmt4d is slow.
Anyone picking up the investigation from here:
What CPU are you measuring on? Here on AMD 7950X3D, setting 1 thread (to be able to make sense of single-thread performance on this CPU) I get
items_per_second=2.66772/s
, which amounts to 75 Gflop/s (counting each multiply-add as two ops as usual).
+1, I wonder the target CPU as well.
Thanks @bjacob for the great analysis! A potentially performance bug could be in packing on f16 types. I have been working on pack codegen on and off, but the work scoped in https://github.com/openxla/iree/issues/16314 is not finished yet. So +1 on what Benoit suggested. We need to tracy this.
@MaheshRavishankar is this one of the tasks that you mentioned @pashu123 could pick up? If so, he can start with what Benoit suggested.
Will you accept "order of magnitude, where magnitude is a power of 2"? :)
Really, this was a test case of the methodology. This is a common kernel in a popular dataset, and I'd be interested to know why it isn't getting near where it should.
Here's a full example trace of a different case, run on a live model: https://drive.google.com/file/d/1lRXO0Eb9aIF3lZmG6-gN8zN0OGr6zZKm/view?usp=drive_link (this includes Q4_1 dequantization)
iree-cpuinfo
sse3 1
ssse3 1
sse4.1 1
sse4.2 1
sse4a 1
avx 1
fma 1
fma4 0
xop 0
f16c 1
avx2 1
avx512f 0
avx512cd 0
avx512vl 0
avx512dq 0
avx512bw 0
avx512ifma 0
avx512vbmi 0
avx512vpopcntdq 0
avx512vnni 0
avx512vbmi2 0
avx512bitalg 0
avx512bf16 0
avx512fp16 0
amx-tile 0
amx-int8 0
amx-bf16 0
And here is a full run with q8_0 kernels on a 3B openllama: https://drive.google.com/file/d/1S-Dm7nlt2jyNfIBJkFhEEAybdd-mEuiD/view?usp=drive_link
(when I say "trace" in this context, I mean, execution log dump of MLIR, VMFB, and a log file that tells you what shapes it was invoked with and observed execution time)
What CPU are you measuring on? Here on AMD 7950X3D, setting 1 thread (to be able to make sense of single-thread performance on this CPU) I get
items_per_second=2.66772/s
, which amounts to 75 Gflop/s (counting each multiply-add as two ops as usual).+1, I wonder the target CPU as well.
Thanks @bjacob for the great analysis! A potentially performance bug could be in packing on f16 types. I have been working on pack codegen on and off, but the work scoped in #16314 is not finished yet. So +1 on what Benoit suggested. We need to tracy this.
@MaheshRavishankar is this one of the tasks that you mentioned @pashu123 could pick up? If so, he can start with what Benoit suggested.
Yes. Already spoke to @pashu123 about this. He is going to start looking into it.
Here's the benchmark dump:
Running iree-benchmark-module
Run on (16 X 5015.12 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x8)
L1 Instruction 32 KiB (x8)
L2 Unified 1024 KiB (x8)
L3 Unified 16384 KiB (x1)
Load Average: 1.13, 0.32, 0.10
***WARNING*** Library was built as DEBUG. Timings may be affected.
--------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
--------------------------------------------------------------------------------------------------------------------------
BM_turbine_llm_mmtfp_3d_8640_3200_f32f16/process_time/real_time 77.3 ms 484 ms 9 items_per_second=12.9406/s
Here's the Tracy profile
@hanhanW. The potential perf bug remains in pack f16 dispatch. Though it was vectorized, the vector type is vector<1xf16> in the llvm ir dump.
Also, the unpack_f32 is not vectorized at all.
--iree-hal-dump-executable-intermediates-to=/tmp --x86-asm-syntax=intel
I will start working on enabling the right vectorization for pack f16. @hanhanW @MaheshRavishankar Sounds good?
The issue is not just vectorization.. Can you attach the dispatches to the issue? I wonder if this is LHS packing or RHS packing; I need to see the actual pack op. It could be amount of work (e.g., implement features, benchmark and analyze IRs/perf) or low-hanging fruit (e.g., set proper tile sizes).
To dump the dispatches, you can add --iree-hal-dump-executable-sources-to=$HOME/dump
to iree-compile
.
--iree-hal-dump-executable-sources-to=$HOME/dump
Here are the dispatches https://gist.github.com/pashu123/a291c7cfc6d2d47930234bf3257d46f8
So this is a broadcast + rhs packing
dispatch. It is fine that the batch dimension is dynamic, because we will always tile it with size=1 in this case. There are few known issues:
(2) may be tricky. @dcaballe and I had been looking at it, but we did not land it yet. There are issues in the draft PR, and we need someone to help triage. https://github.com/openxla/iree/pull/16456
With (1) and (2), we generate vector<2x16xf16> vectors in vectorization, and flatten it to vector<1x32xf16> during vector lowering.
@pashu123 can you attach the IR dumps for the dispatch?
@pashu123 and I looked the IR dump together today, and we found that the vector level tile sizes are all set to 1s in the lowering_config. My intuition is the logics in elementwise op strategy selection is outdated. We used to tile dims with size=1 when there are dynamic shapes. Because we did not have vectorization strategy. It only worked with static shapes. Today, we have peeling, masking, etc tricks, so we need to revisit it. Here are two action items after the discussion:
To quickly iterate 1, we can preset translation_info
and lowering_config
on the op. E.g., see below example and run iree-opt --pass-pipeline='builtin.module(iree-llvmcpu-select-lowering-strategy, func.func(iree-llvmcpu-lower-executable-target))' repro.mlir
side note: please remember to update hal.executable.target
in your experiments.
@pashu123 can you attach the IR dumps for the dispatch?
I have already shared https://gist.github.com/pashu123/bb3a999d6b3a60ab94f7d80c863b67fb, but I am attaching it here for completeness.
please remember to update hal.executable.target in your experiments.
Indeed, this could be important. The above example has cpu_features = "+fma,+avx512f"
and that is missing a number of potentially relevant CPU features that could be used in the pack
codegen.
The exact type f16
is not relevant in itself, all what should matter is that it's a 2-byte type. So, f16
-specific CPU features such as +f16c
should not be relevant. But at least +avx512bw
should be important here, and potentially any of the following as well: +avx,+avx2,+avx512vl,+avx512dq
.
Thanks @pashu123 for the deep investigation in the past few days. Prashant and I looked at the issue closely, and we are making some progress. The below numbers are generated based on Prashant's artifacts. They are executed on my VM but it should be fine. Because the zen4 specific features are not used in the matmul. There are two major issues to me.
With proper lowering_config (which kicks in 16x16 transpose optimization), the performance breakdown (1-threaded) is:
With (1), we can improve broadcast + pack_on_lhs
from 136 ms to 55.55 ms. With (2), we can improve unpack
from 5.82 ms to 2.29 ms.
We are actually doing okay on broadcast + pack_on_lhs
in the prototype. I wrote c++ code to profile memcpy, which takes 63 ms on my VM. There might be some overheads in my c++ code, because it is slower than broadcast + pack_on_lhs
. But the numbers are very close, and they look okay to me.
Here is the ASM dump for the broadcast + pack_on_lhs
: https://gist.github.com/hanhanW/f53428a81521d681a6c1d5cc8f65a017
We dont fully utilize 512 bits in the pack (where they are all ymm
registers), but it is already sub-optimal. We can try to improve that if needed, it will require an amount of work.
@stellaraccident what is your expectation? How do we generate baseline numbers for the case?
@pashu123 please send PRs to fix (1) and (2). You already have something for (1); the prototype of (2) can be found at https://github.com/iree-org/iree/pull/17002 . There might be other issues in multi-threaded. Let's fix these two issues first, and revisit it later.
(side note: This is the microbenchmarks we've been using for the issue.)
(a dispatch that broadcasts is odd - are we sure we need that?)
(a dispatch that broadcasts is odd - are we sure we need that?)
It could just be a snippet IR from the model. It does not matter here because it is free. We already need to pay for packing, and broadcast + pack
is as fast as a single pack.
It's not free if we could have packed 16x less - less to write and less to read :)
Tested on CPU. Performance is at least an order of magnitude below expectations. Needs to be fast on all supported backends.