Closed guillaumekln closed 4 months ago
Reproduced for 274be82 but not for tip of master. Is there a particular reason you cannot use the latest revision?
I was able to reproduce it with d279b39. We have a fix for it already it should be out soon. As an workaround you you can add -DCMAKE_BUILD_TYPE=Release
to the cmake line. If that works for you I will close this issue.
mkl-dnn was compiled with the default build type which is Release
:
https://github.com/intel/mkl-dnn/blob/master/CMakeLists.txt#L50-L53
You are correct, that is the default build mode. This is build issue related.
Even though that's the default build mode, it seems s8s8 gemm (mkl-dnn/src/cpu/gemm/s8x8s32/simple_gemm_s8s8s32.cpp
) file was not being optimized.
Relying in default mode:
g++ -DDNNL_DLL -DDNNL_DLL_EXPORTS -DDNNL_ENABLE_MAX_CPU_ISA -D__STDC_CONSTANT_MACROS -D__STDC_LIMIT_MACROS -std=c++11 -fvisibility-inlines-hidden -Wall -Wno-unknown-pragmas -fvisibility=internal -fPIC -Wformat -Wformat-security -fstack-protector-strong -fopenmp -Wmissing-field-initializers -Wno-strict-overflow -I/nfs/pdx/home/aaraujom/lrepos/mkl-dnn/include -I/nfs/pdx/home/aaraujom/lrepos/mkl-dnn/build_bisect/include -I/nfs/pdx/home/aaraujom/lrepos/mkl-dnn/src -I/nfs/pdx/home/aaraujom/lrepos/mkl-dnn/src/common -I/nfs/pdx/home/aaraujom/lrepos/mkl-dnn/src/cpu -I/nfs/pdx/home/aaraujom/lrepos/mkl-dnn/src/cpu/xbyak -o CMakeFiles/dnnl_cpu.dir/gemm/s8x8s32/simple_gemm_s8s8s32.cpp.o -c /nfs/pdx/home/aaraujom/lrepos/mkl-dnn/src/cpu/gemm/s8x8s32/simple_gemm_s8s8s32.cpp
Using -DCMAKE_BUILD_TYPE=Release
:
g++ -DDNNL_DLL -DDNNL_DLL_EXPORTS -DDNNL_ENABLE_MAX_CPU_ISA -D__STDC_CONSTANT_MACROS -D__STDC_LIMIT_MACROS -std=c++11 -fvisibility-inlines-hidden -Wall -Wno-unknown-pragmas -fvisibility=internal -fPIC -Wformat -Wformat-security -fstack-protector-strong -fopenmp -Wmissing-field-initializers -Wno-strict-overflow -O3 -DNDEBUG -D_FORTIFY_SOURCE=2 -I/nfs/pdx/home/aaraujom/lrepos/mkl-dnn/include -Ilrepos/mkl-dnn/build_bisect/include -Ilrepos/mkl-dnn/src -Ilrepos/mkl-dnn/src/common -Ilrepos/mkl-dnn/src/cpu -Ilrepos/mkl-dnn/src/cpu/xbyak -o CMakeFiles/dnnl_cpu.dir/gemm/s8x8s32/simple_gemm_s8s8s32.cpp.o -c lrepos/mkl-dnn/src/cpu/gemm/s8x8s32/simple_gemm_s8s8s32.cpp
In other words, build line doesn't contain -O3 -DNDEBUG -D_FORTIFY_SOURCE=2
which makes function slow. This is fixed in internal repository and will be push out as soon as possible.
Please use the workaround (-DCMAKE_BUILD_TYPE=Release
) if possible.
Thanks for the details! I verified that the file mkl-dnn/src/cpu/gemm/s8x8s32/simple_gemm_s8s8s32.cpp
is compiled with optimization flags:
cd /home/klein/dev/mkl-dnn/build/src/cpu && /usr/bin/c++ -DDNNL_DLL -DDNNL_DLL_EXPORTS -DDNNL_ENABLE_MAX_CPU_ISA -D__STDC_CONSTANT_MACROS -D__STDC_LIMIT_MACROS -I/home/klein/dev/mkl-dnn/include -I/home/klein/dev/mkl-dnn/build/include -I/home/klein/dev/mkl-dnn/src -I/home/klein/dev/mkl-dnn/src/common -I/home/klein/dev/mkl-dnn/src/cpu -I/home/klein/dev/mkl-dnn/src/cpu/xbyak -std=c++11 -fvisibility-inlines-hidden -Wall -Wno-unknown-pragmas -fvisibility=internal -fPIC -Wformat -Wformat-security -fstack-protector-strong -fopenmp -Wmissing-field-initializers -Wno-strict-overflow -O3 -DNDEBUG -D_FORTIFY_SOURCE=2 -o CMakeFiles/dnnl_cpu.dir/gemm/s8x8s32/simple_gemm_s8s8s32.cpp.o -c /home/klein/dev/mkl-dnn/src/cpu/gemm/s8x8s32/simple_gemm_s8s8s32.cpp
However, that does not seem to change the benchmark numbers reported above.
I was able to reproduce it with d279b39
@aaraujom What was the speed difference that you measured before and after the build fix?
The difference was approximately 3x slower when not using optimizations. If both builds have optimizations I don't see significant difference in my system. In other words I can't reproduce it.
Using v0.21.2:
$ OMP_NUM_THREADS=4 ./benchmark_gemm_s8s8_old
benchmarking FUN(transb, transa, offsetc, &n, &m, &k, &alpha, b, &ldb, &bo, a, &lda, &ao, &beta, c, &ldc, &co)
avg 0.59345 ms
Using d279b39:
$ OMP_NUM_THREADS=4 ./benchmark_gemm_s8s8_new
benchmarking FUN(*transa, *transb, *offsetc, m, n, k, alpha, a, lda, ao, b, ldb, bo, beta, c, ldc, &co)
avg 0.610478 ms
You might want to try the following to root cause the issue:
OMP_PLACES=cores OMP_PROC_BIND=close
for GNU OpenMP or KMP_AFFINITY=compact,granulariy=fine
for Intel OpenMP.LD_PRELOAD=/path/to/libiomp5.so ./your_executable
.This will help to get more stable result and eliminate the possibility of performance differences due to threading library used.
I compiled MKL-DNN v0.21.2 with -DMKLDNN_THREADING=OMP:COMP
to use the same OpenMP runtime and enabled OMP_PLACES=cores OMP_PROC_BIND=close
during execution. The benchmark numbers are still in the same order of magnitude:
$ ldd benchmark_gemm_s8s8_old
linux-vdso.so.1 => (0x00007fff79192000)
libmkldnn.so.0 => /home/klein/dev/mkl-dnn/install/0.21.2/lib/libmkldnn.so.0 (0x00007f4bf8ae7000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f4bf8704000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f4bf833a000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f4bf8136000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f4bf7e2d000)
libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 (0x00007f4bf7bf6000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f4bf79de000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f4bf77c1000)
/lib64/ld-linux-x86-64.so.2 (0x00007f4bfb75d000)
$ OMP_PLACES=cores OMP_PROC_BIND=close OMP_NUM_THREADS=4 ./benchmark_gemm_s8s8_old
benchmarking mkldnn_gemm_s8s8s32(transb, transa, offsetc, &n, &m, &k, &alpha, b, &ldb, &bo, a, &lda, &ao, &beta, c, &ldc, &co)
avg 0.208248 ms
$ ldd benchmark_gemm_s8s8_new
linux-vdso.so.1 => (0x00007ffd41940000)
libdnnl.so.1 => /home/klein/dev/mkl-dnn/install/d279b39d/lib/libdnnl.so.1 (0x00007fb0660e6000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fb065d03000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fb065939000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fb06571c000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fb065518000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fb06520f000)
libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 (0x00007fb064fd8000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fb064dc0000)
/lib64/ld-linux-x86-64.so.2 (0x00007fb067634000)
$ OMP_PLACES=cores OMP_PROC_BIND=close OMP_NUM_THREADS=4 ./benchmark_gemm_s8s8_new
benchmarking mkldnn_gemm_s8s8s32(*transa, *transb, *offsetc, m, n, k, alpha, a, lda, ao, b, ldb, bo, beta, c, ldc, &co)
avg 0.299706 ms
I guess this issue can be closed if it is not reproducible.
@guillaumekln I was able to reproduce it on a i7-6700K. There seems to be 2 issues, one with the build system and one with s8s8 gemm. Thank you for your persistence, I will investigate.
$ make run
echo "Running benchmark for v0.21.2"
Running benchmark for v0.21.2
OMP_NUM_THREADS=4 OMP_PLACES=cores OMP_PROC_BIND=close \
./benchmark_gemm_s8s8_v0212
benchmarking mkldnn_gemm_s8s8s32(transb, transa, offsetc, &n, &m, &k, &alpha, b, &ldb, &bo, a, &lda, &ao, &beta, c, &ldc, &co)
avg 0.249923 ms
echo "Running benchmark for d279b39"
Running benchmark for d279b39
OMP_NUM_THREADS=4 OMP_PLACES=cores OMP_PROC_BIND=close \
./benchmark_gemm_s8s8_d279b39
benchmarking mkldnn_gemm_s8s8s32(*transa, *transb, *offsetc, m, n, k, alpha, a, lda, ao, b, ldb, bo, beta, c, ldc, &co)
avg 0.359445 ms
I did a quick investigation. It seems this is a real issue, but not so simple to fix it since it depends on how memory is manage in DNNL.
As a workaround, instead of limiting to only 4 threads, use all the 8 threads available.
While MKL performs best if using the exact number of cores of your machine (4) instead of all the hyperthreads (8), for d279b39, using all hyperthreads gives best and comparable performance to v0.21.2:
4 threads v0.21.2
$ OMP_NUM_THREADS=4 ./benchmark_gemm_s8s8_v0212
benchmarking mkldnn_gemm_s8s8s32(transb, transa, offsetc, &n, &m, &k, &alpha, b, &ldb, &bo, a, &lda, &ao, &beta, c, &ldc, &co)
avg 0.248977 ms
8 threads v0.21.2
$ OMP_NUM_THREADS=8 ./benchmark_gemm_s8s8_v0212
benchmarking mkldnn_gemm_s8s8s32(transb, transa, offsetc, &n, &m, &k, &alpha, b, &ldb, &bo, a, &lda, &ao, &beta, c, &ldc, &co)
avg 0.448652 ms
4 threads d279b39
$ OMP_NUM_THREADS=4 ./benchmark_gemm_s8s8_d279b39
benchmarking mkldnn_gemm_s8s8s32(*transa, *transb, *offsetc, m, n, k, alpha, a, lda, ao, b, ldb, bo, beta, c, ldc, &co)
avg 0.359389 ms
8 threads d279b39
$ OMP_NUM_THREADS=8 ./benchmark_gemm_s8s8_d279b39
benchmarking mkldnn_gemm_s8s8s32(*transa, *transb, *offsetc, m, n, k, alpha, a, lda, ao, b, ldb, bo, beta, c, ldc, &co)
avg 0.243672 ms
Thanks for the details. Do you expect this to change in the future or should we simply assume it is how DNNL operates?
We certainly want to resolve this. We just are still bikeshedding the solution :) In DNNL we have scratchpads as well, so we are not sure if we need a memory manager or need to change the GEMM API.
Upon further investigation the rootcase is in block size. Reassigning to @aaraujom for analysis and fix.
I'm curious, just reading this, if OMP_NUM_THREADS=4 for this test is thought to imply that thread assignment is one per core ... Does this problem disappear if hyper-threading is disabled?
Closing as stale.
Summary
https://github.com/intel/mkl-dnn/commit/274be8228a0dba6391c2769c37cd68a3bb730fbf added AVX2 optimizations for igemm kernels (as discussed in https://github.com/intel/mkl-dnn/issues/532). However, the execution appears to be 1.4x slower than using version v0.21 compiled with Intel MKL.
In our specific case, this is blocking an upgrade to a newer version of DNNL as Intel MKL offers better performance and a wider range of optimized instruction sets.
Version
The benchmark was run with the latest commit on master: d279b39d978b7d3d3f4f69a427f4cb91d754b9fe.
Environment
Linux minigpu 4.4.0-166-generic #195-Ubuntu SMP Tue Oct 1 09:35:25 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Steps to reproduce
Here is the code that I used for benchmarking:
Compilation with v0.21.2
cmake -DCMAKE_INSTALL_PREFIX=$PWD/../install/0.21.2 -DARCH_OPT_FLAGS="" -DMKLROOT=/opt/intel/mkl -DMKLDNN_USE_MKL=FULL:STATIC -DMKLDNN_THREADING=OMP:INTEL -DWITH_TEST=OFF -DWITH_EXAMPLE=OFF ..
g++ -std=c++17 -O3 -I$PWD/install/0.21.2/include -L$PWD/install/0.21.2/lib -Wl,-rpath,$PWD/install/0.21.2/lib -L/opt/intel/lib/intel64/ -Wl,-rpath,/opt/intel/lib/intel64/ -o benchmark_gemm_s8s8 benchmark_gemm_s8s8.cc -lmkldnn -liomp5
Compilation with d279b39d978b7d3d3f4f69a427f4cb91d754b9fe
cmake -DCMAKE_INSTALL_PREFIX=$PWD/../install/d279b39d -DDNNL_ARCH_OPT_FLAGS="" -DDNNL_BUILD_TESTS=OFF -DDNNL_BUILD_EXAMPLES=OFF ..
g++ -std=c++17 -O3 -I$PWD/install/d279b39d/include -L$PWD/install/d279b39d/lib -Wl,-rpath,$PWD/install/d279b39d/lib -o benchmark_gemm_s8s8 benchmark_gemm_s8s8.cc -lmkldnn
Execution
Observed behavior
Here are the observed results on my system:
avg 0.217827 ms
avg 0.302498 ms
Expected behavior
The execution should ideally be faster or as fast as an older version.
Do you think performance of
gemm_s8s832
on AVX2 could be improved in the future? If not, what are you recommendations to reach the performance of v0.21 but without Intel MKL?Thanks for your time.