It would be extremely useful to provide benchmark utilities, ideally in GFlop/s TFlop/s to compare with other frameworks, compare with the CPU peak theoretical throughput and also linpack.
The formula for MxK multiplied by KxN matrices is:
total required operations: M*K*N*2 2 for 1mul and 1add
divided by time taken
Additionally you might want to check the required data to derive arithmetic intensity for the roofline model:
const
CpuGhz = 3.5 # i9-9980XE OC All turbo 4.1GHz (AVX2 4.0GHz, AVX512 3.5GHz)
NumCpuCores = 18
VectorWidth = 16 # 8 float32 for AVX2, 16 for AVX512
InstrCycle = 2 # How many instructions per cycle, (2xFMAs or 1xFMA for example)
FlopInstr = 2 # How many FLOP per instr (FMAs = 1 add + 1 mul)
TheoSerialPeak* = CpuGhz * VectorWidth * InstrCycle * FlopInstr
TheoThreadedPeak* = TheoSerialPeak * NumCpuCores
FYI, you might be interested in my own research in cache utilization tuning, though skimming a bit I see that you tuned at the cache associativity-level while I used some heuristics:
Hello fellow gemm optimizer enthusiast,
It would be extremely useful to provide benchmark utilities, ideally in GFlop/s TFlop/s to compare with other frameworks, compare with the CPU peak theoretical throughput and also linpack.
The formula for MxK multiplied by KxN matrices is:
M*K*N*2
2 for 1mul and 1addAdditionally you might want to check the required data to derive arithmetic intensity for the roofline model:
M*K+K*N
And finally you might also want to check your theoretical peak like: https://github.com/mratsim/weave/blob/b6255af/benchmarks/matmul_gemm_blas/gemm_bench_config.nim#L5-L18
FYI, you might be interested in my own research in cache utilization tuning, though skimming a bit I see that you tuned at the cache associativity-level while I used some heuristics:
Benchmarks in my own implementation+OpenMP and OpenBLAS/MKL and MKL-DNN (Latest oneDNN was too entangled to extract the relevant GEMM primitives):
Benchmarks with my own multithreading runtime (instead of OpenMP)
If using Intel MKL, library path can be customized here https://github.com/mratsim/weave/blob/b6255af/benchmarks/matmul_gemm_blas/all_gemm.nim