Open vineel96 opened 5 months ago
Hi @vineel96
@mgouicem thanks for the information.
When it comes to lossless compression oneDNN supports sparse tensors with two packing schemes:
Another compression approach involves converting one of the matrices to low precision (int8 or int4). This is lossy compression method, but it's widely used to accelerate large language models. See GPT-Q RFC and matmul weights decompression example for more details.
If you have a proof of concept for other techniques please share.
Hi @vpirogov, Thanks for the reply.
Do COO gives performance boost only if DL model uses sparse tensors? But most of the DL models from hugging face will not be using sparse tensors right? Also, does anyone is already implementing this work or can anyone take it up to complete?
Sparse storage formats like COO and CSR require relatively high levels of zeroes in weights tensor to bring performance value. Think ~80% or more. In context of LLMs hardware specific weight compression on SPR is demonstrated to bring performance benefits on some models by OpenVINO.
- Also anyone is working on GPT-Quantization (GPT-Q) enablement? If not, anyone can take it to complete?
GPT-Q is supported in oneDNN v3.5, which has int8 weights support on CPUs with Intel AMX and int8/int4 support on Intel GPUs. See release notes.
- Currently I am looking into lossless compression methods for matrix multiplication where we either reduce total FLOP of matmul or increase FLOP per sec as it scales across no of threads. I am looking for papers for this regard and found one (https://arxiv.org/pdf/2203.14540) dont know how it helps. Any links/papers/suggestions in this regard is appreciated.
The paper seems to offer a generalization of CSR format that exploits data duplication in addition to dropping zeroes. I would expect it to be a marginal improvement over classical CSR storage schemes.
Thanks @vpirogov
@vineel96, we are working on the initial COO support and GPU optimized version towards the next release, oneDNN v3.6. Feel free to contribute optimized implementation for your favorite platform.
Choosing algorithm based on the input matrix structure is possible with oneDNN, though this check would have to be done during primitive execution and may negatively affect performance.
@vpirogov COO support and GPT-Q support is there for ARM cpu's? or anyone should to take up this task?
No, neither of these are supported on Arm CPUs.
@jondea, @milpuz01, do you have anything in the works for GPT-Q or sparse formats?
@vpirogov , any one worked and completed this RFC : https://github.com/oneapi-src/oneDNN/tree/rfcs/rfcs/20200812-multidim-matmul ? or its still need to be implemented? Do option 1 gives any performance boost compared to brgemm? Any comments on this RFC?
@vineel96, the RFC you are referring to is related to API. We extended matmul primitive on all platforms to support batched case. Performance is implementation specific and comparison to BRGEMM makes no sense here.
Hello,