Closed avinashcpandey closed 1 year ago
@avinashcpandey, thanks for the question. The best way to implement matrix-matrix multiplication in context of deep learning models is matmul primitive. It support batching, fused bias, fused activation (and more with post-ops), low precision, and weights pre-packing. Check out these examples:
Matrices A
and B
in matmul are runtime parameters in matmul. You can reuse memory object representing weights between matmul calls.
Here're details on other options you listed:
dnnl_sgemm
is BLAS-like API that exists for compatibility purposes. It will go away eventually.thanks @vpirogov for the prompt response.
I am trying to simulate few NLP workloads and evaluating performance on Icelake. As NLPs are dominated by MatMuls, I am exploring different matmul API in OneDNN for different seqlen and batch size. I see, there are several offering in OneDNN. My questions are on similar lines.
DNNL direct API call for matrix multiplication (Inner product & MatMul) Question: For reuse weights tensors during inference, how do I call these two? Any example will help.
BLAS like call dnnl::sgemm (this I am more interested in, if this can exploit the best performance from h/w) Question: Is this still the best to use for FP32? What about BF16 and INT8 kernels? For reuse weights tensors during inference, how do I call dnnl:sgemm? Do I need to use some reorder API followed by dnnl:sgemm with tweaked arguments which says that it should work with blocked B matrix?
BRGEMM for matrix multiplication How do I call this API directly from User code? For reuse weights tensors during inference, what APIs I need to use? brgemm_inner_product_fwd_t() brgemm_matmul_t
BRGCONV for convolution (FP32, BF16, INT8) BRGCONV is only applicable for 1x1 convolution, or it plays role in non 1x1 also? I see two instances. CPU_INSTANCE_AVX512(brgemm_1x1_convolution_fwd_t CPU_INSTANCE_AVX512(brgemm_convolution_fwd_t
Thanks in advance!