Open mratsim opened 4 years ago
But running laser alone actually brings great improvements:
$ nim cpp -r -d:release -d:openmp -d:danger --outdir:build benchmarks/gemm/gemm_bench_float32.nim
Hint: used config file '/home/beta/.choosenim/toolchains/nim-1.0.2/config/nim.cfg' [Conf]
Hint: used config file '/home/beta/Programming/Nim/laser/nim.cfg' [Conf]
Hint: operation successful (340 lines compiled; 0.025 sec total; 5.754MiB peakmem; Dangerous Release Build) [SuccessX]
Hint: /home/beta/Programming/Nim/laser/build/gemm_bench_float32 [Exec]
A matrix shape: (M: 1920, N: 1920)
B matrix shape: (M: 1920, N: 1920)
Output shape: (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes: 29.491 MB
Arithmetic intensity: 480.000 FLOP/byte
Theoretical peak single-core: 224.000 GFLOP/s
Theoretical peak multi: 4032.000 GFLOP/s
Make sure to not bench Apple Accelerate or the default Linux BLAS.
Laser production implementation
Collected 10 samples in 0.076 seconds
Average time: 6.928 ms
Stddev time: 3.038 ms
Min time: 5.896 ms
Max time: 15.573 ms
Perf: 2043.146 GFLOP/s
And changing the order can slow down OpenBLAS as well
$ nim cpp -r -d:release -d:openmp -d:danger --outdir:build benchmarks/gemm/gemm_bench_float32.nim
Hint: used config file '/home/beta/.choosenim/toolchains/nim-1.0.2/config/nim.cfg' [Conf]
Hint: used config file '/home/beta/Programming/Nim/laser/nim.cfg' [Conf]
Hint: operation successful (340 lines compiled; 0.025 sec total; 5.754MiB peakmem; Dangerous Release Build) [SuccessX]
Hint: /home/beta/Programming/Nim/laser/build/gemm_bench_float32 [Exec]
A matrix shape: (M: 1920, N: 1920)
B matrix shape: (M: 1920, N: 1920)
Output shape: (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes: 29.491 MB
Arithmetic intensity: 480.000 FLOP/byte
Theoretical peak single-core: 224.000 GFLOP/s
Theoretical peak multi: 4032.000 GFLOP/s
Make sure to not bench Apple Accelerate or the default Linux BLAS.
Laser production implementation
Collected 10 samples in 0.071 seconds
Average time: 6.416 ms
Stddev time: 1.526 ms
Min time: 5.861 ms
Max time: 10.753 ms
Perf: 2206.263 GFLOP/s
OpenBLAS benchmark
Collected 10 samples in 0.151 seconds
Average time: 14.448 ms
Stddev time: 10.255 ms
Min time: 9.415 ms
Max time: 37.410 ms
Perf: 979.779 GFLOP/s
With no code or hardware change at all, after month there is a 2x perf regression, OpenBLAS also is a bit slower (with no package update):
I suspect an issue with glibc OpenMP. (MKL-DNN is linked to Intel OpenMP)