Closed danieldk closed 3 years ago
Can you please run the speed comparison between MKL 2020.0 (last version that still supports MKL_DEBUG_CPU_TYPE) and 2020.3, which should have some perf regression fixed? (see https://software.intel.com/content/www/us/en/develop/articles/intel-math-kernel-library-release-notes-and-new-features.html )
Can you please run the speed comparison between MKL 2020.0 (last version that still supports MKL_DEBUG_CPU_TYPE) and 2020.3, which should have some perf regression fixed? (see https://software.intel.com/content/www/us/en/develop/articles/intel-math-kernel-library-release-notes-and-new-features.html )
The latest version I see there is still 2020 update 2?
(I tested 2020.2.254)
Can you please run the speed comparison between MKL 2020.0 (last version that still supports MKL_DEBUG_CPU_TYPE) and 2020.3, which should have some perf regression fixed? (see https://software.intel.com/content/www/us/en/develop/articles/intel-math-kernel-library-release-notes-and-new-features.html )
It's available for me now as well. First of all, MKL_DEBUG_CPU_TYPE
still does not work.
[sd]gemm
performance on 2020.3 (same programs as above), best of three runs:
$ ./mt-dgemm 4000 | grep GF
GFLOP/s rate: 364.460444 GF/s
$ ./mt-sgemm 4000 | grep GF
GFLOP/s rate: 250.027763 GF/s
Now while overriding mkl_serv_intel_cpu_true
to always detect an Intel GPU, which results in using AVX2 code paths on the Ryzen 3700X:
$ LD_PRELOAD=libfakeintel.so ./mt-dgemm 4000 | grep GF
GFLOP/s rate: 431.414968 GF/s
$ LD_PRELOAD=~/Desktop/models/german/foo/libintel.so ./mt-sgemm 4000 | grep GF
GFLOP/s rate: 858.677202 GF/s
Summary:
MKL_DEBUG_CPU_TYPE
seems to be permanently gone.dgemm
.sgemm
.dgemm
/sgemm
can be fast when forcing detecting an Intel CPU.dgemm
kernel is faster on this Ryzen than the Zen dgemm
kernel.I guess the best MKL options are: revert to MKL 2020.0 or override the Intel CPU check.
I also tested OpenBLAS performance with PyTorch for a transformer network that I am using (though it is a bit hard to use in isolation). For BLAS with threads, OpenBLAS was ~50% slower than MKL. Of course, that may not be true across all networks.
Update: oneMKL 2021.1 does have a Zen SGEMM kernel now.
Unfortunately, on a 3700X, the kernel is still slower than the AVX2 kernel (best of three runs):
$ LD_PRELOAD=./libfakeintel.so ./mt-sgemm-icc 4000 | grep GF
GFLOP/s rate: 838.183370 GF/s
./mt-sgemm-icc 4000 | grep GF
GFLOP/s rate: 566.263011 GF/s
I'll try to rebuild libtorch against oneMKL and see how much difference there is in practice for a transformer network.
Update: oneMKL 2021.1 does have a Zen SGEMM kernel now.
Unfortunately, on a 3700X, the kernel is still slower than the AVX2 kernel (best of three runs):
AVX2 kernel
$ LD_PRELOAD=./libfakeintel.so ./mt-sgemm-icc 4000 | grep GF GFLOP/s rate: 838.183370 GF/s
Zen kernel
./mt-sgemm-icc 4000 | grep GF GFLOP/s rate: 566.263011 GF/s
I'll try to rebuild libtorch against oneMKL and see how much difference there is in practice for a transformer network.
Is oneMKL 2021.1 Zen kernel faster than OpenBLAS on your 3700X?
This is a follow-up to my comments in #460, but I thought it would be better to make an issue out of this rather adding more comments to a merged pull request ;).
The question in short:
MKL_DEBUG_CPU_TYPE
cannot be used anymore in recent BLAS versions. Is it still possible to use AVX2-optimized kernels on AMD Zen CPUs?I did some more investigation. The good news is that apparently Intel is integrating Zen support in MKL. The bad news is that it is that Zen kernels haven't been implemented for every BLAS function, and if no Zen function is available, it switches to the slow SSE kernel.
The following is using MKL 2020.2.254 and a Ryzen 3700X
First, I use the standard ACES DGEMM benchmark:
Hotspot in
perf
:Clearly a Zen-optimized kernel. Then I made an SGEMM version of the same benchmark:
Not so stellar.
perf
reveals code path using SSE (I checked the instructions used):Next, we use
LD_PRELOAD
to override the function that detects Intel CPUs to always return true:Much better! Top function in
perf
:tl;dr: MKL seems to be moving towards supporting AMD Zen. However, it seems that Zen kernels haven't been implemented for every BLAS function yet. Possible (hopefully) temporary workaround: put
DT_NEEDED
item in a library or program's ELF dynamic section to override detection.