Closed qigangwang closed 5 years ago
MKLDNN supports int8 quant. This example shows MKLDNN int8 usage: https://github.com/intel/mkl-dnn/blob/master/examples/simple_net_int8.cpp
For sparse work, mkldnn does not have functionality for this. You can use the methods in the MKL library: https://software.intel.com/en-us/mkl-developer-reference-c-sparse-blas-level-2-and-level-3-routines
One more thing: this link has more information on Int8 in MKLDNN https://intel.github.io/mkl-dnn/ex_int8_simplenet.html
Thanks, Nathan. Is sparse acceleration something in MKL-DNN's roadmap?
@qigangwang, sparsity in deep neural networks is an area of active research mostly focusing on decreasing the number of model parameters.
With existing Intel MKL-DNN functionality you can take advantage from the pruning methods that result in structured sparsity (i.e. reduce problem dimensions like the number of channels or filter size).
When it comes to unstructured sparsity (i.e. zeroes in the activations or weights that do not reduce the problem size) there are no industry standard methods to exploit it for performance gain. Current thinking is that using sparse data storage formats (like CSR or CSC) to filters or activations is unlikely to result in performance gains in comparison to dense formats on Intel processors. We are continuing to monitor research on that topic.
Thank you very much, vpirogov.
MKL-DNN provides pruning methods?
No, we consider pruning to be users' responsibility.
In transformer model,weight have many very small value,and I set this value to zero,so dense weight convert to sparse weight.then I use intel mkl_sparse_s_spmmd to realize sparse by sparse matrix,and I export this funtion to .so,and use the .so in python function,but the performance is worse than tf.matmul() directly . this very confused me.
And for quantization,I quantize bert model mainly matmul and biasadd,but after quantize,the performace also not promote in Intel(R) Xeon(R) Silver.
@zhoujianqian,
modern CPUs are very good at dense matrix-matrix multiplication and apply a number of techniques like SIMD vectorization that allow to perform multiple floating point operations at the same time (16 fp32 multiplications and 16 additions with Intel AVX512).
Having zeroes in a matrix reduces the number of operations required to perform computations. However computations will less benefit from vectorization or will not benefit from it at all. To compensate for this fact your matrix has to be sparse enough. For matrices that are close to dense (say 50% non-zeroes) performance of sparse matrix-matrix multiplication may worse than for dense matrices.
I have two sparse matrix: 13000 256, 256 31140,and both of them sparity is 80%,I want to improve the performance of multiplication,what methods can I have?
Currently MKL or DNNL/MKL-DNN do not provide functionality to speed-up computations of matrices with such a large number of non-zeroes (20%). MKL methods start to works well when % of non-zeroes is <10% (they only support f32, unfortunately).
We are looking into how to accelerate inference speed of a sparsified DNN or quantized DNN. For sparsification, we are able to prune 95% of parameters. And we hope that the inference speed can be accelerated accordingly. For quantization, we use INQ method to quantize the DNN into 5 bits. And we hope that the inference speed can also be accelerated accordingly.
Question: Is there some APIs in MKL-DNN that we can leverage to achieve our goals?