Use of different MatMul APIs for NLP workloads

avinashcpandey commented 1 year ago

I am trying to simulate few NLP workloads and evaluating performance on Icelake. As NLPs are dominated by MatMuls, I am exploring different matmul API in OneDNN for different seqlen and batch size. I see, there are several offering in OneDNN. My questions are on similar lines.

DNNL direct API call for matrix multiplication (Inner product & MatMul) Question: For reuse weights tensors during inference, how do I call these two? Any example will help.
BLAS like call dnnl::sgemm (this I am more interested in, if this can exploit the best performance from h/w) Question: Is this still the best to use for FP32? What about BF16 and INT8 kernels? For reuse weights tensors during inference, how do I call dnnl:sgemm? Do I need to use some reorder API followed by dnnl:sgemm with tweaked arguments which says that it should work with blocked B matrix?
BRGEMM for matrix multiplication How do I call this API directly from User code? For reuse weights tensors during inference, what APIs I need to use? brgemm_inner_product_fwd_t() brgemm_matmul_t
BRGCONV for convolution (FP32, BF16, INT8) BRGCONV is only applicable for 1x1 convolution, or it plays role in non 1x1 also? I see two instances. CPU_INSTANCE_AVX512(brgemm_1x1_convolution_fwd_t CPU_INSTANCE_AVX512(brgemm_convolution_fwd_t

Thanks in advance!

vpirogov commented 1 year ago

@avinashcpandey, thanks for the question. The best way to implement matrix-matrix multiplication in context of deep learning models is matmul primitive. It support batching, fused bias, fused activation (and more with post-ops), low precision, and weights pre-packing. Check out these examples:

Matrices A and B in matmul are runtime parameters in matmul. You can reuse memory object representing weights between matmul calls.

Here're details on other options you listed:

Inner product is semantically close to Pytorch's linear operation and supports backpropagation.
dnnl_sgemm is BLAS-like API that exists for compatibility purposes. It will go away eventually.
BRGEMM and BRGCONV are internal implementation details. These do not have stable API are not intended to be called explicitly from user code. Entry points for user code are matmul and convolution.

avinashcpandey commented 1 year ago

thanks @vpirogov for the prompt response.

oneapi-src / oneDNN

Use of different MatMul APIs for NLP workloads #1658