Unpack ukernel with same input/output shape has different performance for different matmuls

I'm playing with e2e_matmul_benchmark, and notice that the unpack kernel has different performance for different matmuls. The matmul shapes are different but the unpack shapes are identical. Here is an example: matmul {M=384, N=128, K=128} and matmul {M=384, N=128, K=512}. Both of them are unpacking tensor<24x8x16x16xf32> to tensor<384x128xf32>.

Machine configuration:

The CPU frequency is pinged to performance: sudo cpupower frequency-set --governor performance
The -DIREE_ENABLE_RUNTIME_TRACING is off.

To repro:

Run benchmark for {M=384, N=128, K=128}:

❯ perf record -g -o /tmp/perf.data build/runtime/src/iree/builtins/ukernel/tools/e2e_matmul_benchmark --benchmark_min_time=1 --M=384 --N=128 --K=128
2023-04-04T10:39:12-07:00
Running build/runtime/src/iree/builtins/ukernel/tools/e2e_matmul_benchmark
Run on (72 X 3000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x36)
  L1 Instruction 32 KiB (x36)
  L2 Unified 1024 KiB (x36)
  L3 Unified 25344 KiB (x2)
Load Average: 0.22, 0.40, 0.98
***WARNING*** Library was built as DEBUG. Timings may be affected.
-------------------------------------------------------------------------------------------------------------
Benchmark                                                   Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------------
BM_e2e_matmul_f32f32f32_384x128x128_host/real_time        133 us          133 us        16383 items_per_second=94.843G/s

The perf report shows that the unpack kernel takes 7.18 % in total. Thus, the performance of the unpack kernel is 133 * 0.0718 = 9.5494 us.

                           --7.25%--iree_uk_unpack
                                     |
                                      --7.18%--iree_uk_unpack_tile_16x16_x32_x86_64_avx512_base_direct

Run benchmark for {M=384, N=128, K=512}:

❯ perf record -g -o /tmp/perf.data build/runtime/src/iree/builtins/ukernel/tools/e2e_matmul_benchmark --benchmark_min_time=1 --M=384 --N=128 --K=512
2023-04-04T10:41:38-07:00
Running build/runtime/src/iree/builtins/ukernel/tools/e2e_matmul_benchmark
Run on (72 X 1200.2 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x36)
  L1 Instruction 32 KiB (x36)
  L2 Unified 1024 KiB (x36)
  L3 Unified 25344 KiB (x2)
Load Average: 0.26, 0.36, 0.88
***WARNING*** Library was built as DEBUG. Timings may be affected.
-------------------------------------------------------------------------------------------------------------
Benchmark                                                   Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------------
BM_e2e_matmul_f32f32f32_384x512x128_host/real_time        576 us          576 us         4095 items_per_second=87.3391G/s

The perf report shows that the unpack kernel takes 3.01 % in total. Thus, the performance of the unpack kernel is 576 * 0.0301 = 17.3376 us.

                           --3.02%--iree_uk_unpack
                                     |
                                      --3.01%--iree_uk_unpack_tile_16x16_x32_x86_64_avx512_base_direct

One takes 9.5 us, and the other takes 17.34 us. Did I do something wrong or is it a bug?

iree-org / iree

Unpack ukernel with same input/output shape has different performance for different matmuls #12918