iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.84k stars 611 forks source link

Unpack ukernel with same input/output shape has different performance for different matmuls #12918

Closed hanhanW closed 1 year ago

hanhanW commented 1 year ago

I'm playing with e2e_matmul_benchmark, and notice that the unpack kernel has different performance for different matmuls. The matmul shapes are different but the unpack shapes are identical. Here is an example: matmul {M=384, N=128, K=128} and matmul {M=384, N=128, K=512}. Both of them are unpacking tensor<24x8x16x16xf32> to tensor<384x128xf32>.

Machine configuration:

To repro:

Run benchmark for {M=384, N=128, K=128}:

❯ perf record -g -o /tmp/perf.data build/runtime/src/iree/builtins/ukernel/tools/e2e_matmul_benchmark --benchmark_min_time=1 --M=384 --N=128 --K=128
2023-04-04T10:39:12-07:00
Running build/runtime/src/iree/builtins/ukernel/tools/e2e_matmul_benchmark
Run on (72 X 3000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x36)
  L1 Instruction 32 KiB (x36)
  L2 Unified 1024 KiB (x36)
  L3 Unified 25344 KiB (x2)
Load Average: 0.22, 0.40, 0.98
***WARNING*** Library was built as DEBUG. Timings may be affected.
-------------------------------------------------------------------------------------------------------------
Benchmark                                                   Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------------
BM_e2e_matmul_f32f32f32_384x128x128_host/real_time        133 us          133 us        16383 items_per_second=94.843G/s

The perf report shows that the unpack kernel takes 7.18 % in total. Thus, the performance of the unpack kernel is 133 * 0.0718 = 9.5494 us.

                           --7.25%--iree_uk_unpack
                                     |
                                      --7.18%--iree_uk_unpack_tile_16x16_x32_x86_64_avx512_base_direct

Run benchmark for {M=384, N=128, K=512}:

❯ perf record -g -o /tmp/perf.data build/runtime/src/iree/builtins/ukernel/tools/e2e_matmul_benchmark --benchmark_min_time=1 --M=384 --N=128 --K=512
2023-04-04T10:41:38-07:00
Running build/runtime/src/iree/builtins/ukernel/tools/e2e_matmul_benchmark
Run on (72 X 1200.2 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x36)
  L1 Instruction 32 KiB (x36)
  L2 Unified 1024 KiB (x36)
  L3 Unified 25344 KiB (x2)
Load Average: 0.26, 0.36, 0.88
***WARNING*** Library was built as DEBUG. Timings may be affected.
-------------------------------------------------------------------------------------------------------------
Benchmark                                                   Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------------
BM_e2e_matmul_f32f32f32_384x512x128_host/real_time        576 us          576 us         4095 items_per_second=87.3391G/s

The perf report shows that the unpack kernel takes 3.01 % in total. Thus, the performance of the unpack kernel is 576 * 0.0301 = 17.3376 us.

                           --3.02%--iree_uk_unpack
                                     |
                                      --3.01%--iree_uk_unpack_tile_16x16_x32_x86_64_avx512_base_direct

One takes 9.5 us, and the other takes 17.34 us. Did I do something wrong or is it a bug?

hanhanW commented 1 year ago

I had an offline discussion with Benoit, and we found that there are more variants in e2e_matmul_benchmark, e.g., the amount of iterations run varied, it ships pack LHS, pack RHS, mmt4d, unpack as a benchmark suite, etc. This is not the metric that I'm looking for. I should go with pack_benchmark, unpack_benchmark, and mmt4d_benchmark.

closing the issue