oneapi-src / oneDNN

oneAPI Deep Neural Network Library (oneDNN)
https://uxlfoundation.org
Apache License 2.0
3.63k stars 1.01k forks source link

gemm_s8s8s32: latest version is 1.4x slower than v0.21 with Intel MKL on AVX2 #619

Closed guillaumekln closed 4 months ago

guillaumekln commented 4 years ago

Summary

https://github.com/intel/mkl-dnn/commit/274be8228a0dba6391c2769c37cd68a3bb730fbf added AVX2 optimizations for igemm kernels (as discussed in https://github.com/intel/mkl-dnn/issues/532). However, the execution appears to be 1.4x slower than using version v0.21 compiled with Intel MKL.

In our specific case, this is blocking an upgrade to a newer version of DNNL as Intel MKL offers better performance and a wider range of optimized instruction sets.

Version

The benchmark was run with the latest commit on master: d279b39d978b7d3d3f4f69a427f4cb91d754b9fe.

Environment

Steps to reproduce

Here is the code that I used for benchmarking:

#include <algorithm>
#include <cstdlib>
#include <chrono>
#include <iostream>
#include <mkldnn.hpp>

#define BENCHMARK(fun_call, samples)                                    \
  do {                                                                  \
    std::cerr << "benchmarking "#fun_call << std::endl;                 \
    for (size_t i = 0; i < 10; ++i)                                     \
      fun_call;                                                         \
    std::chrono::high_resolution_clock::time_point t1 =                 \
      std::chrono::high_resolution_clock::now();                        \
    for (size_t i = 0; i < static_cast<size_t>(samples); ++i)           \
      fun_call;                                                         \
    std::chrono::high_resolution_clock::time_point t2 =                 \
      std::chrono::high_resolution_clock::now();                        \
    auto duration = std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count(); \
    std::cerr << "avg   "                                               \
      << static_cast<double>(duration) / (samples * 1000)               \
              << " ms" << std::endl;                                    \
  } while (false)

template <typename T>
static T* create_buffer(int size, T value = T()) {
  T* buffer = static_cast<T*>(aligned_alloc(64, size * sizeof (T)));
  std::fill(buffer, buffer + size, value);
  return buffer;
}

int main() {
  const int m = 64;
  const int n = 2048;
  const int k = 512;

  const char* transa = "N";
  const char* transb = "T";
  const char* offsetc = "F";

  const int lda = *transa == 'T' ? m : k;
  const int ldb = *transb == 'T' ? k : n;
  const int ldc = n;

  const float alpha = 1;
  const float beta = 0;

  int8_t ao = 0;
  int8_t bo = 0;
  int32_t co = 0;

  auto* a = create_buffer<int8_t>(m * k, 1);
  auto* b = create_buffer<int8_t>(n * k, 2);
  auto* c = create_buffer<int32_t>(m * n);

#if MKLDNN_VERSION_MAJOR < 1
  // mkl-dnn 0.x assumes column major order so swap a and b accordingly.
  BENCHMARK(mkldnn_gemm_s8s8s32(transb, transa, offsetc,
                                &n, &m, &k,
                                &alpha,
                                b, &ldb, &bo,
                                a, &lda, &ao,
                                &beta,
                                c, &ldc, &co),
            50000);
#else
  BENCHMARK(mkldnn_gemm_s8s8s32(*transa, *transb, *offsetc,
                                m, n, k,
                                alpha,
                                a, lda, ao,
                                b, ldb, bo,
                                beta,
                                c, ldc, &co),
            50000);
#endif

  free(a);
  free(b);
  free(c);
}

Compilation with v0.21.2

Compilation with d279b39d978b7d3d3f4f69a427f4cb91d754b9fe

Execution

OMP_NUM_THREADS=4 ./benchmark_gemm_s8s8

Observed behavior

Here are the observed results on my system:

Expected behavior

The execution should ideally be faster or as fast as an older version.

Do you think performance of gemm_s8s832 on AVX2 could be improved in the future? If not, what are you recommendations to reach the performance of v0.21 but without Intel MKL?

Thanks for your time.

rsdubtso commented 4 years ago

Reproduced for 274be82 but not for tip of master. Is there a particular reason you cannot use the latest revision?

aaraujom commented 4 years ago

I was able to reproduce it with d279b39. We have a fix for it already it should be out soon. As an workaround you you can add -DCMAKE_BUILD_TYPE=Release to the cmake line. If that works for you I will close this issue.

guillaumekln commented 4 years ago

mkl-dnn was compiled with the default build type which is Release:

https://github.com/intel/mkl-dnn/blob/master/CMakeLists.txt#L50-L53

aaraujom commented 4 years ago

You are correct, that is the default build mode. This is build issue related.

Even though that's the default build mode, it seems s8s8 gemm (mkl-dnn/src/cpu/gemm/s8x8s32/simple_gemm_s8s8s32.cpp) file was not being optimized.

Relying in default mode:

g++   -DDNNL_DLL -DDNNL_DLL_EXPORTS -DDNNL_ENABLE_MAX_CPU_ISA -D__STDC_CONSTANT_MACROS -D__STDC_LIMIT_MACROS -std=c++11 -fvisibility-inlines-hidden  -Wall -Wno-unknown-pragmas -fvisibility=internal  -fPIC -Wformat -Wformat-security -fstack-protector-strong  -fopenmp -Wmissing-field-initializers  -Wno-strict-overflow                                   -I/nfs/pdx/home/aaraujom/lrepos/mkl-dnn/include -I/nfs/pdx/home/aaraujom/lrepos/mkl-dnn/build_bisect/include -I/nfs/pdx/home/aaraujom/lrepos/mkl-dnn/src -I/nfs/pdx/home/aaraujom/lrepos/mkl-dnn/src/common -I/nfs/pdx/home/aaraujom/lrepos/mkl-dnn/src/cpu -I/nfs/pdx/home/aaraujom/lrepos/mkl-dnn/src/cpu/xbyak    -o CMakeFiles/dnnl_cpu.dir/gemm/s8x8s32/simple_gemm_s8s8s32.cpp.o -c /nfs/pdx/home/aaraujom/lrepos/mkl-dnn/src/cpu/gemm/s8x8s32/simple_gemm_s8s8s32.cpp

Using -DCMAKE_BUILD_TYPE=Release:

g++   -DDNNL_DLL -DDNNL_DLL_EXPORTS -DDNNL_ENABLE_MAX_CPU_ISA -D__STDC_CONSTANT_MACROS -D__STDC_LIMIT_MACROS -std=c++11 -fvisibility-inlines-hidden  -Wall -Wno-unknown-pragmas -fvisibility=internal  -fPIC -Wformat -Wformat-security -fstack-protector-strong  -fopenmp -Wmissing-field-initializers  -Wno-strict-overflow  -O3 -DNDEBUG -D_FORTIFY_SOURCE=2 -I/nfs/pdx/home/aaraujom/lrepos/mkl-dnn/include -Ilrepos/mkl-dnn/build_bisect/include -Ilrepos/mkl-dnn/src -Ilrepos/mkl-dnn/src/common -Ilrepos/mkl-dnn/src/cpu -Ilrepos/mkl-dnn/src/cpu/xbyak    -o CMakeFiles/dnnl_cpu.dir/gemm/s8x8s32/simple_gemm_s8s8s32.cpp.o -c lrepos/mkl-dnn/src/cpu/gemm/s8x8s32/simple_gemm_s8s8s32.cpp

In other words, build line doesn't contain -O3 -DNDEBUG -D_FORTIFY_SOURCE=2 which makes function slow. This is fixed in internal repository and will be push out as soon as possible.

Please use the workaround (-DCMAKE_BUILD_TYPE=Release) if possible.

guillaumekln commented 4 years ago

Thanks for the details! I verified that the file mkl-dnn/src/cpu/gemm/s8x8s32/simple_gemm_s8s8s32.cpp is compiled with optimization flags:

cd /home/klein/dev/mkl-dnn/build/src/cpu && /usr/bin/c++ -DDNNL_DLL -DDNNL_DLL_EXPORTS -DDNNL_ENABLE_MAX_CPU_ISA -D__STDC_CONSTANT_MACROS -D__STDC_LIMIT_MACROS -I/home/klein/dev/mkl-dnn/include -I/home/klein/dev/mkl-dnn/build/include -I/home/klein/dev/mkl-dnn/src -I/home/klein/dev/mkl-dnn/src/common -I/home/klein/dev/mkl-dnn/src/cpu -I/home/klein/dev/mkl-dnn/src/cpu/xbyak -std=c++11 -fvisibility-inlines-hidden -Wall -Wno-unknown-pragmas -fvisibility=internal -fPIC -Wformat -Wformat-security -fstack-protector-strong -fopenmp -Wmissing-field-initializers -Wno-strict-overflow -O3 -DNDEBUG -D_FORTIFY_SOURCE=2 -o CMakeFiles/dnnl_cpu.dir/gemm/s8x8s32/simple_gemm_s8s8s32.cpp.o -c /home/klein/dev/mkl-dnn/src/cpu/gemm/s8x8s32/simple_gemm_s8s8s32.cpp

However, that does not seem to change the benchmark numbers reported above.

I was able to reproduce it with d279b39

@aaraujom What was the speed difference that you measured before and after the build fix?

aaraujom commented 4 years ago

The difference was approximately 3x slower when not using optimizations. If both builds have optimizations I don't see significant difference in my system. In other words I can't reproduce it.

Using v0.21.2:

$ OMP_NUM_THREADS=4 ./benchmark_gemm_s8s8_old
benchmarking FUN(transb, transa, offsetc, &n, &m, &k, &alpha, b, &ldb, &bo, a, &lda, &ao, &beta, c, &ldc, &co)
avg   0.59345 ms

Using d279b39:

$ OMP_NUM_THREADS=4 ./benchmark_gemm_s8s8_new
benchmarking FUN(*transa, *transb, *offsetc, m, n, k, alpha, a, lda, ao, b, ldb, bo, beta, c, ldc, &co)
avg   0.610478 ms

You might want to try the following to root cause the issue:

This will help to get more stable result and eliminate the possibility of performance differences due to threading library used.

guillaumekln commented 4 years ago

I compiled MKL-DNN v0.21.2 with -DMKLDNN_THREADING=OMP:COMP to use the same OpenMP runtime and enabled OMP_PLACES=cores OMP_PROC_BIND=close during execution. The benchmark numbers are still in the same order of magnitude:

$ ldd benchmark_gemm_s8s8_old 
        linux-vdso.so.1 =>  (0x00007fff79192000)
        libmkldnn.so.0 => /home/klein/dev/mkl-dnn/install/0.21.2/lib/libmkldnn.so.0 (0x00007f4bf8ae7000)
        libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f4bf8704000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f4bf833a000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f4bf8136000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f4bf7e2d000)
        libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 (0x00007f4bf7bf6000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f4bf79de000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f4bf77c1000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f4bfb75d000)
$ OMP_PLACES=cores OMP_PROC_BIND=close OMP_NUM_THREADS=4 ./benchmark_gemm_s8s8_old
benchmarking mkldnn_gemm_s8s8s32(transb, transa, offsetc, &n, &m, &k, &alpha, b, &ldb, &bo, a, &lda, &ao, &beta, c, &ldc, &co)
avg   0.208248 ms

$ ldd benchmark_gemm_s8s8_new
        linux-vdso.so.1 =>  (0x00007ffd41940000)
        libdnnl.so.1 => /home/klein/dev/mkl-dnn/install/d279b39d/lib/libdnnl.so.1 (0x00007fb0660e6000)
        libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fb065d03000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fb065939000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fb06571c000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fb065518000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fb06520f000)
        libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 (0x00007fb064fd8000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fb064dc0000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fb067634000)
$ OMP_PLACES=cores OMP_PROC_BIND=close OMP_NUM_THREADS=4 ./benchmark_gemm_s8s8_new
benchmarking mkldnn_gemm_s8s8s32(*transa, *transb, *offsetc, m, n, k, alpha, a, lda, ao, b, ldb, bo, beta, c, ldc, &co)
avg   0.299706 ms

I guess this issue can be closed if it is not reproducible.

aaraujom commented 4 years ago

@guillaumekln I was able to reproduce it on a i7-6700K. There seems to be 2 issues, one with the build system and one with s8s8 gemm. Thank you for your persistence, I will investigate.

$ make run
echo "Running benchmark for v0.21.2"
Running benchmark for v0.21.2
OMP_NUM_THREADS=4 OMP_PLACES=cores OMP_PROC_BIND=close \
                        ./benchmark_gemm_s8s8_v0212
benchmarking mkldnn_gemm_s8s8s32(transb, transa, offsetc, &n, &m, &k, &alpha, b, &ldb, &bo, a, &lda, &ao, &beta, c, &ldc, &co)
avg   0.249923 ms
echo "Running benchmark for d279b39"
Running benchmark for d279b39
OMP_NUM_THREADS=4 OMP_PLACES=cores OMP_PROC_BIND=close \
                        ./benchmark_gemm_s8s8_d279b39
benchmarking mkldnn_gemm_s8s8s32(*transa, *transb, *offsetc, m, n, k, alpha, a, lda, ao, b, ldb, bo, beta, c, ldc, &co)
avg   0.359445 ms
aaraujom commented 4 years ago

I did a quick investigation. It seems this is a real issue, but not so simple to fix it since it depends on how memory is manage in DNNL.

As a workaround, instead of limiting to only 4 threads, use all the 8 threads available.

While MKL performs best if using the exact number of cores of your machine (4) instead of all the hyperthreads (8), for d279b39, using all hyperthreads gives best and comparable performance to v0.21.2:

4 threads v0.21.2

$ OMP_NUM_THREADS=4 ./benchmark_gemm_s8s8_v0212
benchmarking mkldnn_gemm_s8s8s32(transb, transa, offsetc, &n, &m, &k, &alpha, b, &ldb, &bo, a, &lda, &ao, &beta, c, &ldc, &co)
avg   0.248977 ms

8 threads v0.21.2

$ OMP_NUM_THREADS=8 ./benchmark_gemm_s8s8_v0212
benchmarking mkldnn_gemm_s8s8s32(transb, transa, offsetc, &n, &m, &k, &alpha, b, &ldb, &bo, a, &lda, &ao, &beta, c, &ldc, &co)
avg   0.448652 ms

4 threads d279b39

$ OMP_NUM_THREADS=4 ./benchmark_gemm_s8s8_d279b39
benchmarking mkldnn_gemm_s8s8s32(*transa, *transb, *offsetc, m, n, k, alpha, a, lda, ao, b, ldb, bo, beta, c, ldc, &co)
avg   0.359389 ms

8 threads d279b39

$ OMP_NUM_THREADS=8 ./benchmark_gemm_s8s8_d279b39
benchmarking mkldnn_gemm_s8s8s32(*transa, *transb, *offsetc, m, n, k, alpha, a, lda, ao, b, ldb, bo, beta, c, ldc, &co)
avg   0.243672 ms
guillaumekln commented 4 years ago

Thanks for the details. Do you expect this to change in the future or should we simply assume it is how DNNL operates?

rsdubtso commented 4 years ago

We certainly want to resolve this. We just are still bikeshedding the solution :) In DNNL we have scratchpads as well, so we are not sure if we need a memory manager or need to change the GEMM API.

vpirogov commented 4 years ago

Upon further investigation the rootcase is in block size. Reassigning to @aaraujom for analysis and fix.

jnorwood commented 3 years ago

I'm curious, just reading this, if OMP_NUM_THREADS=4 for this test is thought to imply that thread assignment is one per core ... Does this problem disappear if hyper-threading is disabled?

vpirogov commented 4 months ago

Closing as stale.