mratsim / laser

The HPC toolbox: fused matrix multiplication, convolution, data-parallel strided tensor primitives, OpenMP facilities, SIMD, JIT Assembler, CPU detection, state-of-the-art vectorized BLAS for floats and integers
Apache License 2.0
281 stars 15 forks source link

[GEMM] Significant performance regression (divided by 5) #32

Closed mratsim closed 5 years ago

mratsim commented 5 years ago

Since #28 that fixed #27, another strange regression appeared, dividing per by 5:

from a March 23 build

$  ./build/gemm_f32_omp

A matrix shape: (M: 1920, N: 1920)
B matrix shape: (M: 1920, N: 1920)
Output shape: (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:    224.000 GFLOP/s
Theoretical peak multi:         4032.000 GFLOP/s
Make sure to not bench Apple Accelerate or the default Linux BLAS.

Reference loop
Collected 10 samples in 10.421 seconds
Average time: 1041.539 ms
Stddev  time: 3.983 ms
Min     time: 1035.329 ms
Max     time: 1047.674 ms
Perf:         13.591 GFLOP/s

OpenBLAS benchmark
Collected 10 samples in 0.091 seconds
Average time: 8.438 ms
Stddev  time: 6.319 ms
Min     time: 6.240 ms
Max     time: 26.393 ms
Perf:         1677.596 GFLOP/s

Laser production implementation
Collected 10 samples in 0.087 seconds
Average time: 8.035 ms
Stddev  time: 4.186 ms
Min     time: 6.517 ms
Max     time: 19.913 ms
Perf:         1761.855 GFLOP/s

PyTorch Glow: libjit matmul implementation (with AVX+FMA)
Collected 10 samples in 1.900 seconds
Average time: 189.987 ms
Stddev  time: 2.893 ms
Min     time: 188.794 ms
Max     time: 198.044 ms
Perf:         74.509 GFLOP/s

MKL-DNN reference GEMM benchmark
Collected 10 samples in 0.368 seconds
Average time: 36.043 ms
Stddev  time: 5.048 ms
Min     time: 34.275 ms
Max     time: 50.364 ms
Perf:         392.748 GFLOP/s

MKL-DNN JIT AVX benchmark
Collected 10 samples in 0.105 seconds
Average time: 9.758 ms
Stddev  time: 5.933 ms
Min     time: 7.715 ms
Max     time: 26.624 ms
Perf:         1450.731 GFLOP/s

MKL-DNN JIT AVX512 benchmark
Collected 10 samples in 0.088 seconds
Average time: 8.154 ms
Stddev  time: 10.128 ms
Min     time: 4.733 ms
Max     time: 36.938 ms
Perf:         1736.020 GFLOP/s
Mean Relative Error compared to vendor BLAS: 3.045843413929106e-06

From a recent rebuild

$  ./build/gemm_omp_f32

A matrix shape: (M: 1920, N: 1920)
B matrix shape: (M: 1920, N: 1920)
Output shape: (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:    224.000 GFLOP/s
Theoretical peak multi:         4032.000 GFLOP/s
Make sure to not bench Apple Accelerate or the default Linux BLAS.

Laser production implementation
Collected 10 samples in 0.555 seconds
Average time: 54.917 ms
Stddev  time: 5.027 ms
Min     time: 53.250 ms
Max     time: 69.218 ms
Perf:         257.765 GFLOP/s
mratsim commented 5 years ago

Issues is in Nim upstream, rebuilding Laser current master with the Nim OpenMP commit (https://github.com/nim-lang/Nim/commit/25649616ea5b6aba575149df3f9943f48a5ece31) brings back full performance

mratsim commented 5 years ago

After bisecting, the cause is the split of -d:release into -d:release -d:danger https://github.com/nim-lang/Nim/pull/11385