spgemm and INQ quantization acceleration

oneapi-src / oneDNN

oneAPI Deep Neural Network Library (oneDNN)

https://uxlfoundation.org

Apache License 2.0

3.58k stars 985 forks source link

spgemm and INQ quantization acceleration #427

Closed qigangwang closed 5 years ago

qigangwang commented 5 years ago

We are looking into how to accelerate inference speed of a sparsified DNN or quantized DNN. For sparsification, we are able to prune 95% of parameters. And we hope that the inference speed can be accelerated accordingly. For quantization, we use INQ method to quantize the DNN into 5 bits. And we hope that the inference speed can also be accelerated accordingly.

Question: Is there some APIs in MKL-DNN that we can leverage to achieve our goals?

nathan-greeneltch-intel commented 5 years ago

MKLDNN supports int8 quant. This example shows MKLDNN int8 usage: https://github.com/intel/mkl-dnn/blob/master/examples/simple_net_int8.cpp

nathan-greeneltch-intel commented 5 years ago

For sparse work, mkldnn does not have functionality for this. You can use the methods in the MKL library: https://software.intel.com/en-us/mkl-developer-reference-c-sparse-blas-level-2-and-level-3-routines

nathan-greeneltch-intel commented 5 years ago

One more thing: this link has more information on Int8 in MKLDNN https://intel.github.io/mkl-dnn/ex_int8_simplenet.html

qigangwang commented 5 years ago

Thanks, Nathan. Is sparse acceleration something in MKL-DNN's roadmap?

vpirogov commented 5 years ago

@qigangwang, sparsity in deep neural networks is an area of active research mostly focusing on decreasing the number of model parameters.

With existing Intel MKL-DNN functionality you can take advantage from the pruning methods that result in structured sparsity (i.e. reduce problem dimensions like the number of channels or filter size).

When it comes to unstructured sparsity (i.e. zeroes in the activations or weights that do not reduce the problem size) there are no industry standard methods to exploit it for performance gain. Current thinking is that using sparse data storage formats (like CSR or CSC) to filters or activations is unlikely to result in performance gains in comparison to dense formats on Intel processors. We are continuing to monitor research on that topic.

qigangwang commented 5 years ago

Thank you very much, vpirogov.

zhoujianqian commented 4 years ago

MKL-DNN provides pruning methods?

rsdubtso commented 4 years ago

No, we consider pruning to be users' responsibility.

zhoujianqian commented 4 years ago

In transformer model,weight have many very small value,and I set this value to zero,so dense weight convert to sparse weight.then I use intel mkl_sparse_s_spmmd to realize sparse by sparse matrix,and I export this funtion to .so,and use the .so in python function,but the performance is worse than tf.matmul() directly . this very confused me.

zhoujianqian commented 4 years ago

And for quantization,I quantize bert model mainly matmul and biasadd,but after quantize,the performace also not promote in Intel(R) Xeon(R) Silver.

vpirogov commented 4 years ago

@zhoujianqian,

modern CPUs are very good at dense matrix-matrix multiplication and apply a number of techniques like SIMD vectorization that allow to perform multiple floating point operations at the same time (16 fp32 multiplications and 16 additions with Intel AVX512).

Having zeroes in a matrix reduces the number of operations required to perform computations. However computations will less benefit from vectorization or will not benefit from it at all. To compensate for this fact your matrix has to be sparse enough. For matrices that are close to dense (say 50% non-zeroes) performance of sparse matrix-matrix multiplication may worse than for dense matrices.

zhoujianqian commented 4 years ago

I have two sparse matrix: 13000 256, 256 31140，and both of them sparity is 80%,I want to improve the performance of multiplication,what methods can I have?

rsdubtso commented 4 years ago

Currently MKL or DNNL/MKL-DNN do not provide functionality to speed-up computations of matrices with such a large number of non-zeroes (20%). MKL methods start to works well when % of non-zeroes is <10% (they only support f32, unfortunately).