mratsim / laser

The HPC toolbox: fused matrix multiplication, convolution, data-parallel strided tensor primitives, OpenMP facilities, SIMD, JIT Assembler, CPU detection, state-of-the-art vectorized BLAS for floats and integers
Apache License 2.0
273 stars 15 forks source link

Mysterious 2x perf regression on GEMM #40

Open mratsim opened 4 years ago

mratsim commented 4 years ago

With no code or hardware change at all, after month there is a 2x perf regression, OpenBLAS also is a bit slower (with no package update):

A matrix shape: (M: 1920, N: 1920)
B matrix shape: (M: 1920, N: 1920)
Output shape: (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:    224.000 GFLOP/s
Theoretical peak multi:         4032.000 GFLOP/s
Make sure to not bench Apple Accelerate or the default Linux BLAS.

OpenBLAS benchmark
Collected 10 samples in 0.101 seconds
Average time: 9.440 ms
Stddev  time: 0.141 ms
Min     time: 9.315 ms
Max     time: 9.733 ms
Perf:         1499.508 GFLOP/s

Laser production implementation
Collected 10 samples in 0.146 seconds
Average time: 14.000 ms
Stddev  time: 25.706 ms
Min     time: 5.839 ms
Max     time: 87.161 ms
Perf:         1011.102 GFLOP/s

PyTorch Glow: libjit matmul implementation (with AVX+FMA)
Collected 10 samples in 2.041 seconds
Average time: 204.123 ms
Stddev  time: 0.763 ms
Min     time: 203.362 ms
Max     time: 205.862 ms
Perf:         69.349 GFLOP/s

MKL-DNN reference GEMM benchmark
Collected 10 samples in 0.351 seconds
Average time: 34.305 ms
Stddev  time: 5.588 ms
Min     time: 30.013 ms
Max     time: 49.684 ms
Perf:         412.645 GFLOP/s

MKL-DNN JIT AVX benchmark
Collected 10 samples in 0.130 seconds
Average time: 11.230 ms
Stddev  time: 8.353 ms
Min     time: 7.725 ms
Max     time: 34.426 ms
Perf:         1260.573 GFLOP/s

MKL-DNN JIT AVX512 benchmark
Collected 10 samples in 0.083 seconds
Average time: 7.716 ms
Stddev  time: 7.932 ms
Min     time: 4.601 ms
Max     time: 30.078 ms
Perf:         1834.643 GFLOP/s
Mean Relative Error compared to vendor BLAS: 3.045843413929106e-06

I suspect an issue with glibc OpenMP. (MKL-DNN is linked to Intel OpenMP)

mratsim commented 4 years ago

But running laser alone actually brings great improvements:

$  nim cpp -r -d:release -d:openmp -d:danger --outdir:build benchmarks/gemm/gemm_bench_float32.nim
Hint: used config file '/home/beta/.choosenim/toolchains/nim-1.0.2/config/nim.cfg' [Conf]
Hint: used config file '/home/beta/Programming/Nim/laser/nim.cfg' [Conf]
Hint: operation successful (340 lines compiled; 0.025 sec total; 5.754MiB peakmem; Dangerous Release Build) [SuccessX]
Hint: /home/beta/Programming/Nim/laser/build/gemm_bench_float32  [Exec]

A matrix shape: (M: 1920, N: 1920)
B matrix shape: (M: 1920, N: 1920)
Output shape: (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:    224.000 GFLOP/s
Theoretical peak multi:         4032.000 GFLOP/s
Make sure to not bench Apple Accelerate or the default Linux BLAS.

Laser production implementation
Collected 10 samples in 0.076 seconds
Average time: 6.928 ms
Stddev  time: 3.038 ms
Min     time: 5.896 ms
Max     time: 15.573 ms
Perf:         2043.146 GFLOP/s
mratsim commented 4 years ago

And changing the order can slow down OpenBLAS as well

$  nim cpp -r -d:release -d:openmp -d:danger --outdir:build benchmarks/gemm/gemm_bench_float32.nim
Hint: used config file '/home/beta/.choosenim/toolchains/nim-1.0.2/config/nim.cfg' [Conf]
Hint: used config file '/home/beta/Programming/Nim/laser/nim.cfg' [Conf]
Hint: operation successful (340 lines compiled; 0.025 sec total; 5.754MiB peakmem; Dangerous Release Build) [SuccessX]
Hint: /home/beta/Programming/Nim/laser/build/gemm_bench_float32  [Exec]

A matrix shape: (M: 1920, N: 1920)
B matrix shape: (M: 1920, N: 1920)
Output shape: (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:    224.000 GFLOP/s
Theoretical peak multi:         4032.000 GFLOP/s
Make sure to not bench Apple Accelerate or the default Linux BLAS.

Laser production implementation
Collected 10 samples in 0.071 seconds
Average time: 6.416 ms
Stddev  time: 1.526 ms
Min     time: 5.861 ms
Max     time: 10.753 ms
Perf:         2206.263 GFLOP/s

OpenBLAS benchmark
Collected 10 samples in 0.151 seconds
Average time: 14.448 ms
Stddev  time: 10.255 ms
Min     time: 9.415 ms
Max     time: 37.410 ms
Perf:         979.779 GFLOP/s