oneapi-src / oneDNN

oneAPI Deep Neural Network Library (oneDNN)
https://uxlfoundation.org
Apache License 2.0
3.6k stars 992 forks source link

Can we force MKLDNN to use cblas always instead of JIT kernels generated at runtime #415

Closed avinashcpandey closed 5 years ago

avinashcpandey commented 5 years ago

Can we force MKLDNN to use cblas functions always instead of JIT kernels generated at runtime? i am using external library openBLAS and want to use that for all gemm related work

emfomenk commented 5 years ago

Intel MKL-DNN doesn't give any guarantees regarding what gemm would be used if both Cblas and jit gemms are available. In the current implementation mkldnn_sgemm() would use regular CBlas if it is available. However RNN primitive might explicitly use jitted gemm even if CBlas is available.

Is there any particular reason why you want to avoid using jitted gemm?

avinashcpandey commented 5 years ago

Hi, I want to do comparison b/w mkldnn with openblas vs mkldnn with MKL blas...and for this I want for all convolution operation it goes to regular gemm. And in the end I will evaluate mkldnn with this JIT based convolution. This I am doing with tensors flow framework for resnet50 model. With profiling that I see most of the time going in JIT based kernel....which vtune shows as "outside any module" profiled data.

I am looking for some way to force mkldnn to don't use this JIT based gemm convolution.

emfomenk commented 5 years ago

To disable jit-based convolutions use the trick described here.

For convolutions it is safe to assume that they would use CBlas sgemm for now. You can ensure by looking at MKLDNN_VERBOSE. If jitted-gemm-based convolution is used you will see gemm:jit. If cblas is used: gemm:blas.

To enable VTune report see this page.

avinashcpandey commented 5 years ago

thanks @emfomenk. I tried this with simplenet and its working fine. Now i want to try the same with tensorflow. I know tensorflow downloads MKL-DNN from source and build it and then link it statically. Is there a way to link it as a shared library in tensorflow? so that i can modify the MKLDNN code and then link it easily.

emfomenk commented 5 years ago

Hi @avinashcpandey,

I don't think TF has a standard way of doing so. Sometime back I workarounded that by removing the source of Intel MKL-DNN from TF (hence leaving the U-symblos) and then run workloads with LD_PRELOAD=libmkldnn.so. That allowed me to make quick library replacements. But I am not sure that can work now.

There is also a way to say bazel where to download an external dependencies, and then (once downloaded for the first time) you can patch them and rebuild TF. That should be relatively fast. I think the option is --output_base=/path/to/dir.

mgouicem commented 5 years ago

Hi @avinashcpandey, you can use your own copy of MKL-DNN sources with Tensorflow by modifying the tensorflow/workspace.bzl file. Modify the mkl_dnn rule as follow:

native.new_local_repository(
    name = "mkl_dnn",
    build_file = clean_dep("//third_party/mkl_dnn:mkldnn.BUILD"),
    path = "/path/to/your/local/copy",
)

Hope this helps.

avinashcpandey commented 5 years ago

Thanks @emfomenk and @mgouicem! Its working for me.

avinashcpandey commented 5 years ago

I did some benchmarking b/w jit based and non jit based(gemm based) convolution. For gemm based i used MKL BLAS calls. I see good performance difference between these

I ran with resnet50 model of Tensorflow(python tf_cnn_benchmarks.py --device=cpu --model=resnet50 --data_format=NHWC --batch_size=256 --num_batches=1 --num_inter_threads=1 --num_intra_threads=16 --mkl=True --nodistortions --forward_only=True) Observation: Total 583 convolution calls Convolution time with JIT 31sec Convolution time without JIT(MKL blas) Total BLAS sgemm calls 149248 59 sec Here i see 256 calls for every convolution(same matrix size) (256*583=149248)

I know JIT kernel will be faster But not sure for such a difference even with MKL BLAS Why I see such a huge no of sgemm calls(256*583) for without jitted version?

vpirogov commented 5 years ago

I would guess that Tensorflow uses it's own implementation when you force convolutions to BLAS. Might be just parallelism over the batch size.

Did you look at what MKLDNN_VERBOSE prints out during these runs?

avinashcpandey commented 5 years ago

I build TF with MKL blas...so for blas going into MKL.

With jitted kernels mkldnn_verbose,exec,convolution,jit:avx2,forward_training,fsrc:nChw8c fwei:OIhw8i8o fbia:x fdst:nChw8c,alg:convolution_direct

Without jitted kernels mkldnn_verbose,exec,convolution,gemm:blas,forward_training,fsrc:nchw fwei:oihw fbia:x fdst:nchw,alg:convolution_direc

Convolution time with JIT’ed kernels 5059.991 ms Convolution time with non JIT’ed (taking 1.7x than JIT’ed) 8994.4565 ms

Resnet50 Convolution time with JIT’ed kernels 30910.3539 ms Convolution time with non JIT’ed (taking 1.9x than JIT’ed) 58822.9284 ms

Question is for with MKL blas why performance is that slow?

kwiersch commented 5 years ago

@avinashcpandey GEMM-based convolution requires so-called im2col transformation of src data so that convolution reduces to a GEMM operation. That adds some overhead, so the JIT kernel is indeed expected to be faster. In general, I the GEMM-based implementation is intended for convolution shapes not yet supported by JIT (such as grouped convolutions with number of channels per group that is not a multiple of the SIMD width).

avinashcpandey commented 5 years ago

Thanks @kwiersch. I get it now. I observe one more thing, for Alexnet with batch size=1 the no. of JIT based convolution calls are 55 and GEMM calls also same i.e 55. However if batch size =256 no. of Convolution calls stays same as 55 but GEMM calls increases to 55*246.

I see GEMM calls that happens because of convolution size which remain same as 33 in both cases.

I am trying to understand as batch sizes increases JIT based convolution calls remain same but GEMM based increases.

Can you point me to some paper who explain the same?

kwiersch commented 5 years ago

@avinashcpandey can you explain how you are counting calls to JIT-based convolution and GEMM-based convolution? is it appearing in output using MKLDNN_VERBOSE (if so, please share it!)? or are you adding some hook to count the calls?

one possible explaination for "extra" GEMM calls is parallelization strategy for large minibatch: as you can see at https://github.com/intel/mkl-dnn/blob/master/src/cpu/gemm_convolution.cpp#L63, work is split by minibatch, so there will be jcp.mb calls to SGEMM.

avinashcpandey commented 5 years ago

With MKL_VERBOSE and MKL_DUMP I am getting no. Of JIT kernels for particular run at a same time I have print statement(printing some string matrix shape) in cblas interface for gemm which is linked to MKL and in the end I am doing grep on that.

This way I see the difference in no. of gemm calls with or without JIT interface.

Sample o/p With jitted kernels mkldnn_verbose,exec,convolution,jit:avx2,forward_training,fsrc:nChw8c fwei:OIhw8i8o fbia:x fdst:nChw8c,alg:convolution_direct

Without jitted kernels mkldnn_verbose,exec,convolution,gemm:blas,forward_training,fsrc:nchw fwei:oihw fbia:x fdst:nchw,alg:convolution_direct

kwiersch commented 5 years ago

Okay. In that case, I don't think it is correct to say that the number of JIT calls stays the same while the number of GEMM calls increases. Instead, think of it as the number of JIT-based or GEMM-based convolutional primitives remaining constant (55, as you said above). Within each GEMM-based convolutional primitive, you are seeing the number of calls to SGEMM increase from 1 to 246 as the batch sizes is changed from 1 to 246. As I suspected, this is due to the way that GEMM-based implementation chooses its threading strategy: for small batch sizes, threading is done within the SGEMM routine (ie 1 call to a parallel SGEMM), while for large batch sizes, threading is done outside of the SGEMM routine (ie many calls to a sequentia SGEMM). Hope that makes sense.

avinashcpandey commented 5 years ago

Ok. I got you. One last thing algo: convolution_direct is mat mul based algo right? And the same is implemented in JIT based and GEMM based right? I am curious about the way JIT based convolution_direct kernels work.

kwiersch commented 5 years ago

So, algo:convolution has three possible values: convolution_auto, convolution_direct, and convolution_wino. First option chooses for you, second is "direct" as in "do the calculations directly without any complex changes to order of operations", and third is Winograd which you can read about here.

JIT based convolution uses Just-in-time compilation to generate a computational kernel during initialization For more JIT implementation details, see the source code: harness / driver and kernel generator.

Not exactly sure what you mean by mat mul...

avinashcpandey commented 5 years ago

Thanks @kwiersch. By mat mul I meant matrix multiplication as convolution operation gets transformed into matrix multiplication operation and then it is computes using BLAS library or JIT based convolution.

kwiersch commented 5 years ago

Okay, I think I understand. From that perspective, the "JIT" implementations are not really mat mul (at least not in the way that the "GEMM" implementations are). Instead of transforming the input data for a call to SGEMM, the JIT kernels contained in src/cpu/jit_{isa}_conv* basically apply the convolutional filter to the input image one row of pixels at a time (with some cache blocking and other techniques to boost performance).

avinashcpandey commented 5 years ago

Ok..this is what I wanted to know. Thanks I will look into the code as you pointed src/cpu/jit_{isa}_conv*. Apart form that if this based on some research paper Can you point me to the same?

vpirogov commented 5 years ago

The paper linked below explains the algorithms and optimizations in details.

Anatomy Of High-Performance Deep Learning Convolutions On SIMD Architectures

avinashcpandey commented 5 years ago

Thanks @vpirogov

kruus commented 5 years ago

USE_CBLAS (or ref_gemm) + v1.0 --> wrong results

  1. USE_CBLAS should not invoke cblas_sgemm for packed formats. Ex. TestGEMM_packed_fp32
    • I guess extended_sgemm is being used for integer matrix formats as in Intel docs?
    • Possibly:
      1. USE_CBLAS should also check for packed transa/transb value,
      2. and punt to some other integer reference impl (might ref_gemm<T> be compatible?)
    • Surprisingly, some packed tests pass with extended_sgemm invoking ref_gemm<float>, even though ref_gemm seems to not check for packed format
  2. rnn code frequently create some gemm calls with *lda==0. Intel MKL cblas prints a loud error message every time this happens. Ex: cpu-rnn-inference-f32-cpp. The test passes.
    • This might be intentional: check_gemm_input explicitly allows M,N,K dims to be zero.

The USE_CBLAS call to cblas_sgemm is here.

Here are examples of calls generating an ERROR message about lda

10: Test command: /local/kruus/mkl-dnn/build-jitd/examples/cpu-rnn-inference-f32-cpp
10: Test timeout computed to be: 9.99988e+06
10: Parameters:
10:  batch = 128
10:  feature size = 1024
10:  maximum source sequence length = 28
10:  maximum target sequence length = 28
10:  number of layers of the bidirectional encoder = 1
10:  number of layers of the unidirectional encoder = 7
10:  number of layers of the decoder = 8
10: cblas_sgemm(102,N,N;MNK=4096,128,1024;alpha=1.000000,A@ld=4112,B@ld=1040,beta=0.000000,C@ld=4112)
10: cblas_sgemm(102,P,N;MNK=4096,128,1024;alpha=1.000000,A@ld=0,B@ld=1040,beta=1.000000,C@ld=4112)
10:                                                                                                  
10: Intel MKL ERROR: Parameter 9 was incorrect on entry to cblas_sgemm.

(The cblas_sgemm are from a printf inserted before calling cblas_sgemm.) I also see all cblas_sgemm calls in tests/gtests/test_rnn_forward also can cause the Intel MKL ERROR: Parameter 9 was incorrect on entry to cblas_sgemm. spam.

Without USE_CBLAS, and temporarily disabling gemm_driver with if (0 && mayiuse(sse41)), one can test how the reference impl behaves. It too sees calls with *lda==0 (quietly). The reference impl invokes ref_gemm<float>.

10: ref_gemm<float>(N,N;MNK=4096,128,1024;alpha=1.000000,A@ld=4112,B@ld=1040,beta=0.000000,C@ld=4112,bias)
10: ref_gemm<float>(P,N;MNK=4096,128,1024;alpha=1.000000,A@ld=0,B@ld=1040,beta=1.000000,C@ld=4112,bias)

I also see a segfault, after many subtests with wrong result messages:

45: [ RUN      ] TestGEMM_fp32/gemm_test.TestGEMM/21
45: cblas_sgemm(102,n,n;MNK=2000,2000,2000;alpha=1.000000,A@ld=2000,B@ld=2000,beta=0.000000,C@ld=2000)
45: /local/kruus/mkl-dnn/tests/gtests/test_gemm_common.hpp:445: Failure
45: The difference between e and 0.0 is 1.4646060466766357, which exceeds 1e-4, where
45: e evaluates to 1.4646060466766357,
45: 0.0 evaluates to 0, and
45: 1e-4 evaluates to 0.0001.
45: Row: 0 Col: 0
etc.
45: [  FAILED  ] TestGEMM_packed_fp32/gemm_test.TestGEMM/26, where GetParam() = 104-byte object <74-74 00-00 00-00 00-00 D0-07 00-00 00-00 00-00 88-13 00-00 00-00 00-00 D0-07 00-00 00-00 00-00 00-00 80-3F 00-00 00-40 D0-07 00-00 00-00 00-00 D0-07 00-00 00-00 00-00 88-13 00-00 00-00 00-00 00-00 00-00 01-01 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00> (498 ms)
45: [ RUN      ] TestGEMM_packed_fp32/gemm_test.TestGEMM/27
32/37 Test #45: test_gemm_f32 .................................***Exception: SegFault 23.80 sec

Wrong results are often wrong by .8 -- 3, way above the 1e-4 threshold


Oh, using cblas with v1.0 build (for examples and tests) is a tiny bit difficult, mostly because register_exe was not pushing LDFLAGS-->CMAKE_EXE_LINKER_FLAGS into the executable link step (now an EXTRA_SHARED_LIBS variable is used). I eventually just added a new option MKLDNN_USE_CBLAS into options.cmake, added a FindCBLAS script from inria and added a tiny cmake/cblas.cmake.