pytorch / builder

Continuous builder and binary build scripts for pytorch
BSD 2-Clause "Simplified" License
327 stars 219 forks source link

Update to recent MKL version and Zen performance #504

Closed danieldk closed 3 years ago

danieldk commented 4 years ago

This is a follow-up to my comments in #460, but I thought it would be better to make an issue out of this rather adding more comments to a merged pull request ;).

The question in short: MKL_DEBUG_CPU_TYPE cannot be used anymore in recent BLAS versions. Is it still possible to use AVX2-optimized kernels on AMD Zen CPUs?

I did some more investigation. The good news is that apparently Intel is integrating Zen support in MKL. The bad news is that it is that Zen kernels haven't been implemented for every BLAS function, and if no Zen function is available, it switches to the slow SSE kernel.

The following is using MKL 2020.2.254 and a Ryzen 3700X

First, I use the standard ACES DGEMM benchmark:

$ ./mt-dgemm 4000 | grep GF
GFLOP/s rate:         227.809737 GF/s

Hotspot in perf:

65.34%  mt-dgemm  libmkl_def.so       [.] mkl_blas_def_dgemm_kernel_zen

Clearly a Zen-optimized kernel. Then I made an SGEMM version of the same benchmark:

$ ./mt-sgemm 4000 | grep GF
GFLOP/s rate:         151.946679 GF/s

Not so stellar. perf reveals code path using SSE (I checked the instructions used):

74.26%  mt-sgemm  libmkl_def.so     [.] LM_LOOPgas_1

Next, we use LD_PRELOAD to override the function that detects Intel CPUs to always return true:

$ LD_PRELOAD=libfakeintel.so ./mt-sgemm 2000 | grep GF
GFLOP/s rate:         382.358381 GF/s

Much better! Top function in perf:

59.73%  mt-sgemm  libmkl_avx2.so          [.] mkl_blas_avx2_sgemm_kernel_0

tl;dr: MKL seems to be moving towards supporting AMD Zen. However, it seems that Zen kernels haven't been implemented for every BLAS function yet. Possible (hopefully) temporary workaround: put DT_NEEDED item in a library or program's ELF dynamic section to override detection.

malfet commented 4 years ago

Can you please run the speed comparison between MKL 2020.0 (last version that still supports MKL_DEBUG_CPU_TYPE) and 2020.3, which should have some perf regression fixed? (see https://software.intel.com/content/www/us/en/develop/articles/intel-math-kernel-library-release-notes-and-new-features.html )

danieldk commented 4 years ago

Can you please run the speed comparison between MKL 2020.0 (last version that still supports MKL_DEBUG_CPU_TYPE) and 2020.3, which should have some perf regression fixed? (see https://software.intel.com/content/www/us/en/develop/articles/intel-math-kernel-library-release-notes-and-new-features.html )

The latest version I see there is still 2020 update 2?

(I tested 2020.2.254)

danieldk commented 4 years ago

Can you please run the speed comparison between MKL 2020.0 (last version that still supports MKL_DEBUG_CPU_TYPE) and 2020.3, which should have some perf regression fixed? (see https://software.intel.com/content/www/us/en/develop/articles/intel-math-kernel-library-release-notes-and-new-features.html )

It's available for me now as well. First of all, MKL_DEBUG_CPU_TYPE still does not work.

[sd]gemm performance on 2020.3 (same programs as above), best of three runs:

$ ./mt-dgemm 4000 | grep GF
GFLOP/s rate:         364.460444 GF/s
$ ./mt-sgemm 4000 | grep GF
GFLOP/s rate:         250.027763 GF/s

Now while overriding mkl_serv_intel_cpu_true to always detect an Intel GPU, which results in using AVX2 code paths on the Ryzen 3700X:

$ LD_PRELOAD=libfakeintel.so ./mt-dgemm 4000 | grep GF
GFLOP/s rate:         431.414968 GF/s
$ LD_PRELOAD=~/Desktop/models/german/foo/libintel.so ./mt-sgemm 4000 | grep GF
GFLOP/s rate:         858.677202 GF/s

Summary:

I guess the best MKL options are: revert to MKL 2020.0 or override the Intel CPU check.

I also tested OpenBLAS performance with PyTorch for a transformer network that I am using (though it is a bit hard to use in isolation). For BLAS with threads, OpenBLAS was ~50% slower than MKL. Of course, that may not be true across all networks.

danieldk commented 3 years ago

Update: oneMKL 2021.1 does have a Zen SGEMM kernel now.

Screenshot from 2021-03-14 11-34-51

Unfortunately, on a 3700X, the kernel is still slower than the AVX2 kernel (best of three runs):

AVX2 kernel

$ LD_PRELOAD=./libfakeintel.so ./mt-sgemm-icc 4000 | grep GF
GFLOP/s rate:         838.183370 GF/s

Zen kernel

./mt-sgemm-icc 4000 | grep GF
GFLOP/s rate:         566.263011 GF/s

I'll try to rebuild libtorch against oneMKL and see how much difference there is in practice for a transformer network.

ekerazha commented 3 years ago

Update: oneMKL 2021.1 does have a Zen SGEMM kernel now.

Screenshot from 2021-03-14 11-34-51

Unfortunately, on a 3700X, the kernel is still slower than the AVX2 kernel (best of three runs):

AVX2 kernel

$ LD_PRELOAD=./libfakeintel.so ./mt-sgemm-icc 4000 | grep GF
GFLOP/s rate:         838.183370 GF/s

Zen kernel

./mt-sgemm-icc 4000 | grep GF
GFLOP/s rate:         566.263011 GF/s

I'll try to rebuild libtorch against oneMKL and see how much difference there is in practice for a transformer network.

Is oneMKL 2021.1 Zen kernel faster than OpenBLAS on your 3700X?