Performance issue of dgemm on Gold 6230R

FuncJ commented 2 years ago

Hi, I have some questions about the performance of dgemm on Intel(R) Xeon(R) Gold 6230R CPU. I have read your article "Anatomy Of High-Performance Deep Learning Convolutions On SIMD Architectures" published on SC18. I am a little confused about the experimental part. In your experiments, the computational efficiency of the SGEMM subroutine is about 92%. But on my machine, the performance of DGEMM seems a little weird. When the number of threads is large, the performance curve will rise and then fall, which is very difficult to explain. Below are some details. I really hope to get your help, thank you.

My Machine CPU(s): 104 On-line CPU(s) list: 0-103 Thread(s) per core: 2 Core(s) per socket: 26 Socket(s): 2 NUMA node(s): 2 CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Gold 6230R CPU @ 2.10GHz L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 36608K NUMA node0 CPU(s): 0-25,52-77 NUMA node1 CPU(s): 26-51,78-103

Core topology: two sockets, 26 cores per socket, 52 cores total SMT status: enabled, but not utilized Max clock rate: 2.0GHz (single-core and multicore) Peak performance: --single-core: 64 GFLOPS(double-precision) --multicore: 64 GFLOPS/core (double-precision) I have fixed the frequency of the CPU at 2.0GHz by commands: sudo cpupower -c all frequency-set -u 2.0GHz, sudo cpupower -c all frequency-set -d 2.0GHz

The dgemm performance on my machine

Single-threaded (1 core) execution
Multithreaded (8 core) execution
Multithreaded (13 core) execution
Multithreaded (26 core) execution
Multithreaded (52 core) execution

vpirogov commented 2 years ago

@FuncJ,

The 92% SGEMM efficiency number you quote is peak efficiency measured with unspecified implementation. Measured efficiency you see dependent on the following factors:

DGEMM implementation
Specific sizes and shapes of matrices
Benchmark implementation
Execution conditions (like thread pinning)
Hardware configuration

Looking at discussion of this issue on BLIS Github (https://github.com/flame/blis/issues/631) I tend to agree that the most likely root cause is memory bandwidth issue. You may try to get more insights from oneMKL developers on oneMKL community forum.

FuncJ commented 2 years ago

thanks.

oneapi-src / oneDNN

Performance issue of dgemm on Gold 6230R #1394