New/other Matrix multiplication algorithm implementation

vineel96 commented 3 months ago

Hello,

The current implementation for matrix multiplication uses BRGEMM algorithm. Is there any implementation of "Low Rank Approximation approach" for matrix multiplication in oneDNN? Is there any experiments done by intel team for "Low Rank Approximation" implementation and found any bottlenecks for this algorithm compared to BRGEMM?
Any experiments done to replace broadcast kernel of BRGEMM algorithm with other kernels? Any performance comparison results for the same? Thanks.

mgouicem commented 3 months ago

Hi @vineel96

I guess you are referring to sampling algorithms right? oneDNN typically does not do such things and implements algorithms that are equivalent (in terms of operations) to a naive Matrix Multiplication. Regarding brgemm, it is not an algorithm per-se, but mostly a flexible low level abstraction for assembly kernels.
BRGEMM has multiple strategies depending on problem shapes and ISA (e.g. AMX kernels do not rely on broadcast strategy). For implementations working on vector registers, we indeed use broadcast kernels (either implicit or explicit broadcasts). I think we did some experiments at some point with dot-product based approach (with horizontal reduction), but performance was typically low for problem shapes that are encountered in Deep Learning.

vineel96 commented 3 months ago

@mgouicem thanks for the information.

Yes. Is there a chance to implement lossless compression methods for matrices which reduces the GFLOP for matrix multiplication as a whole which helps in reducing its timing in onednn?
Any scope to further improve the MatMul by reducing its overall GFLOP? like introducing new algorithms ?

vpirogov commented 3 months ago

When it comes to lossless compression oneDNN supports sparse tensors with two packing schemes:

Compressed sparse row (CSR) widely used in scientific applications. Support for Coordinate format (COO) is coming shortly.
Hardware-specific packed formats. You can find more information in the Developer Guide.

Another compression approach involves converting one of the matrices to low precision (int8 or int4). This is lossy compression method, but it's widely used to accelerate large language models. See GPT-Q RFC and matmul weights decompression example for more details.

If you have a proof of concept for other techniques please share.

vineel96 commented 3 months ago

Hi @vpirogov, Thanks for the reply.

Do COO gives performance boost only if DL model uses sparse tensors? But most of the DL models from hugging face will not be using sparse tensors right? Also, does anyone is already implementing this work or can anyone take it up to complete?
Also anyone is working on GPT-Quantization (GPT-Q) enablement? If not, anyone can take it to complete?
Currently I am looking into lossless compression methods for matrix multiplication where we either reduce total FLOP of matmul or increase FLOP per sec as it scales across no of threads. I am looking for papers for this regard and found one (https://arxiv.org/pdf/2203.14540) dont know how it helps. Any links/papers/suggestions in this regard is appreciated. Thanks.

vpirogov commented 3 months ago

Do COO gives performance boost only if DL model uses sparse tensors? But most of the DL models from hugging face will not be using sparse tensors right? Also, does anyone is already implementing this work or can anyone take it up to complete?

Sparse storage formats like COO and CSR require relatively high levels of zeroes in weights tensor to bring performance value. Think ~80% or more. In context of LLMs hardware specific weight compression on SPR is demonstrated to bring performance benefits on some models by OpenVINO.

Also anyone is working on GPT-Quantization (GPT-Q) enablement? If not, anyone can take it to complete?

GPT-Q is supported in oneDNN v3.5, which has int8 weights support on CPUs with Intel AMX and int8/int4 support on Intel GPUs. See release notes.

Currently I am looking into lossless compression methods for matrix multiplication where we either reduce total FLOP of matmul or increase FLOP per sec as it scales across no of threads. I am looking for papers for this regard and found one (https://arxiv.org/pdf/2203.14540) dont know how it helps. Any links/papers/suggestions in this regard is appreciated.

The paper seems to offer a generalization of CSR format that exploits data duplication in addition to dropping zeroes. I would expect it to be a marginal improvement over classical CSR storage schemes.

vineel96 commented 3 months ago

Thanks @vpirogov

For COO rfc , does anyone is already implementing this work or can anyone take it up to complete it?
Also is it allowed in oneDNN to keep extra checks on input matrix, say whether its diagonal matrix or some pattern matrix and based on that run an optimized algorithm? Similar to this paper: https://arxiv.org/abs/2208.07129 (Fast hardware-aware matrix-free algorithm for higher-order finite-element discretized matrix multivector products on distributed systems)

vpirogov commented 3 months ago

@vineel96, we are working on the initial COO support and GPU optimized version towards the next release, oneDNN v3.6. Feel free to contribute optimized implementation for your favorite platform.

Choosing algorithm based on the input matrix structure is possible with oneDNN, though this check would have to be done during primitive execution and may negatively affect performance.

vineel96 commented 3 months ago

@vpirogov COO support and GPT-Q support is there for ARM cpu's? or anyone should to take up this task?

vpirogov commented 2 months ago

No, neither of these are supported on Arm CPUs.

@jondea, @milpuz01, do you have anything in the works for GPT-Q or sparse formats?

vineel96 commented 1 month ago

@vpirogov , any one worked and completed this RFC : https://github.com/oneapi-src/oneDNN/tree/rfcs/rfcs/20200812-multidim-matmul ? or its still need to be implemented? Do option 1 gives any performance boost compared to brgemm? Any comments on this RFC?

vpirogov commented 1 month ago

@vineel96, the RFC you are referring to is related to API. We extended matmul primitive on all platforms to support batched case. Performance is implementation specific and comparison to BRGEMM makes no sense here.

oneapi-src / oneDNN

New/other Matrix multiplication algorithm implementation #1971