Provide benchmark with throughput units (GFlops/s TFlops/s)

Hello fellow gemm optimizer enthusiast,

It would be extremely useful to provide benchmark utilities, ideally in GFlop/s TFlop/s to compare with other frameworks, compare with the CPU peak theoretical throughput and also linpack.

The formula for MxK multiplied by KxN matrices is:

total required operations: M*K*N*2 2 for 1mul and 1add
divided by time taken

Additionally you might want to check the required data to derive arithmetic intensity for the roofline model:

required data: M*K+K*N

And finally you might also want to check your theoretical peak like: https://github.com/mratsim/weave/blob/b6255af/benchmarks/matmul_gemm_blas/gemm_bench_config.nim#L5-L18

const
  CpuGhz = 3.5      # i9-9980XE OC All turbo 4.1GHz (AVX2 4.0GHz, AVX512 3.5GHz)
  NumCpuCores = 18
  VectorWidth = 16  # 8 float32 for AVX2, 16 for AVX512
  InstrCycle = 2    # How many instructions per cycle, (2xFMAs or 1xFMA for example)
  FlopInstr = 2     # How many FLOP per instr (FMAs = 1 add + 1 mul)

  TheoSerialPeak* = CpuGhz * VectorWidth * InstrCycle * FlopInstr
  TheoThreadedPeak* = TheoSerialPeak * NumCpuCores

FYI, you might be interested in my own research in cache utilization tuning, though skimming a bit I see that you tuned at the cache associativity-level while I used some heuristics:

https://github.com/bluss/matrixmultiply/issues/34#issuecomment-445412450

Benchmarks in my own implementation+OpenMP and OpenBLAS/MKL and MKL-DNN (Latest oneDNN was too entangled to extract the relevant GEMM primitives):

https://github.com/mratsim/laser https://github.com/mratsim/laser/blob/d310294/benchmarks/gemm/gemm_bench_float32.nim#L374 Nim must be installed, and OpenBLAS or MKL and then (the submodule will download MKL-DNN)
```
git clone https://github.com/mratsim/laser
cd laser
git submodule init
nim cpp -r -d:danger -d:openmp --outdir:build benchmarks/gemm/gemm_bench_float32.nim
```

Benchmarks with my own multithreading runtime (instead of OpenMP)

https://github.com/mratsim/weave https://github.com/mratsim/weave/blob/b6255af/benchmarks/matmul_gemm_blas/all_gemm.nim Nim must be installed, and OpenBLAS or MKL and then (the submodule will download MKL-DNN)
```
git clone https://github.com/mratsim/weave
cd weave
nim c -r -d:danger -threads:on --outdir:build benchmarks/matmul_gemm_blas/all_gemm.nim
```
If using Intel MKL, library path can be customized here https://github.com/mratsim/weave/blob/b6255af/benchmarks/matmul_gemm_blas/all_gemm.nim

sarah-quinones / gemm

Provide benchmark with throughput units (GFlops/s TFlops/s) #26