Closed FuncJ closed 2 years ago
@FuncJ,
The 92% SGEMM efficiency number you quote is peak efficiency measured with unspecified implementation. Measured efficiency you see dependent on the following factors:
Looking at discussion of this issue on BLIS Github (https://github.com/flame/blis/issues/631) I tend to agree that the most likely root cause is memory bandwidth issue. You may try to get more insights from oneMKL developers on oneMKL community forum.
thanks.
Hi, I have some questions about the performance of dgemm on Intel(R) Xeon(R) Gold 6230R CPU. I have read your article "Anatomy Of High-Performance Deep Learning Convolutions On SIMD Architectures" published on SC18. I am a little confused about the experimental part. In your experiments, the computational efficiency of the SGEMM subroutine is about 92%. But on my machine, the performance of DGEMM seems a little weird. When the number of threads is large, the performance curve will rise and then fall, which is very difficult to explain. Below are some details. I really hope to get your help, thank you.
My Machine CPU(s): 104 On-line CPU(s) list: 0-103 Thread(s) per core: 2 Core(s) per socket: 26 Socket(s): 2 NUMA node(s): 2 CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Gold 6230R CPU @ 2.10GHz L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 36608K NUMA node0 CPU(s): 0-25,52-77 NUMA node1 CPU(s): 26-51,78-103
Core topology: two sockets, 26 cores per socket, 52 cores total SMT status: enabled, but not utilized Max clock rate: 2.0GHz (single-core and multicore) Peak performance: --single-core: 64 GFLOPS(double-precision) --multicore: 64 GFLOPS/core (double-precision) I have fixed the frequency of the CPU at 2.0GHz by commands: sudo cpupower -c all frequency-set -u 2.0GHz, sudo cpupower -c all frequency-set -d 2.0GHz
The dgemm performance on my machine
Single-threaded (1 core) execution
Multithreaded (8 core) execution
Multithreaded (13 core) execution
Multithreaded (26 core) execution
Multithreaded (52 core) execution