sarah-ek / gemm

MIT License
76 stars 11 forks source link

Provide benchmark with throughput units (GFlops/s TFlops/s) #26

Open mratsim opened 5 months ago

mratsim commented 5 months ago

Hello fellow gemm optimizer enthusiast,

It would be extremely useful to provide benchmark utilities, ideally in GFlop/s TFlop/s to compare with other frameworks, compare with the CPU peak theoretical throughput and also linpack.

The formula for MxK multiplied by KxN matrices is:

Additionally you might want to check the required data to derive arithmetic intensity for the roofline model:

And finally you might also want to check your theoretical peak like: https://github.com/mratsim/weave/blob/b6255af/benchmarks/matmul_gemm_blas/gemm_bench_config.nim#L5-L18

const
  CpuGhz = 3.5      # i9-9980XE OC All turbo 4.1GHz (AVX2 4.0GHz, AVX512 3.5GHz)
  NumCpuCores = 18
  VectorWidth = 16  # 8 float32 for AVX2, 16 for AVX512
  InstrCycle = 2    # How many instructions per cycle, (2xFMAs or 1xFMA for example)
  FlopInstr = 2     # How many FLOP per instr (FMAs = 1 add + 1 mul)

  TheoSerialPeak* = CpuGhz * VectorWidth * InstrCycle * FlopInstr
  TheoThreadedPeak* = TheoSerialPeak * NumCpuCores

FYI, you might be interested in my own research in cache utilization tuning, though skimming a bit I see that you tuned at the cache associativity-level while I used some heuristics:

Benchmarks in my own implementation+OpenMP and OpenBLAS/MKL and MKL-DNN (Latest oneDNN was too entangled to extract the relevant GEMM primitives):

Benchmarks with my own multithreading runtime (instead of OpenMP)

sarah-ek commented 5 months ago

thanks for the suggestion. I'll set up something for that soon