Closed hanhanW closed 1 year ago
I had an offline discussion with Benoit, and we found that there are more variants in e2e_matmul_benchmark, e.g., the amount of iterations run varied, it ships pack LHS, pack RHS, mmt4d, unpack
as a benchmark suite, etc. This is not the metric that I'm looking for. I should go with pack_benchmark, unpack_benchmark, and mmt4d_benchmark.
closing the issue
I'm playing with
e2e_matmul_benchmark
, and notice that the unpack kernel has different performance for different matmuls. The matmul shapes are different but the unpack shapes are identical. Here is an example:matmul {M=384, N=128, K=128}
andmatmul {M=384, N=128, K=512}
. Both of them are unpackingtensor<24x8x16x16xf32>
totensor<384x128xf32>
.Machine configuration:
sudo cpupower frequency-set --governor performance
-DIREE_ENABLE_RUNTIME_TRACING
is off.To repro:
Run benchmark for {M=384, N=128, K=128}:
The perf report shows that the unpack kernel takes 7.18 % in total. Thus, the performance of the unpack kernel is
133 * 0.0718 = 9.5494
us.Run benchmark for {M=384, N=128, K=512}:
The perf report shows that the unpack kernel takes 3.01 % in total. Thus, the performance of the unpack kernel is
576 * 0.0301 = 17.3376
us.One takes 9.5 us, and the other takes 17.34 us. Did I do something wrong or is it a bug?