oneapi-src / oneDNN

oneAPI Deep Neural Network Library (oneDNN)

https://uxlfoundation.org

Apache License 2.0

3.63k stars 1k forks source link

Add support for INT4/UINT4 #1712

Closed WilliamTambellini closed 5 months ago

WilliamTambellini commented 1 year ago

Summary

Add support for INT4 and/or UINT4 Refs: https://intellabs.github.io/distiller/quantization.html https://developer.nvidia.com/blog/int4-for-ai-inference/ https://arxiv.org/abs/2301.12017 https://arxiv.org/pdf/2306.11987.pdf https://www.xilinx.com/support/documents/white_papers/wp521-4bit-optimization.pdf

Problem statement

Describe the problem you are trying to solve with reasonable level of details. Fast low 4bit precision quantized matmul.

Preferred solution

A new onednn datatype and at least quantmatmul (no need for full arithm/math).

vpirogov commented 1 year ago

This is on our radar, in particular in context of large language models. As the references indicate this area is an active field of research, in particular focusing on techniques to minimize accuracy loss. Are there any specific quantization approaches and usage models you have in mind? Anything validated in production setting?

WilliamTambellini commented 1 year ago

Are there any specific quantization approaches and usage models you have in mind?

Cannot legally say much but there are already some opensource LLM quantizers : https://github.com/PanQiWei/AutoGPTQ but they could requires some samples to quantize on. Be aware of static vs dynamic quant. Most Transformer decoder would do the job for you to test, eg: https://huggingface.co/Qwen/Qwen-7B-Chat-Int4#%E9%87%8F%E5%8C%96-quantization

Anything validated in production setting?

Cannot reply publicly but as long as the perplexity after quantization is "close" to the one with bf16 then you should be good.

Could you confirm that SapphireRapids CPUs (4th gen) dont seem to have any hardware support (AMX, ..) for s4 nor u4 math ?

vpirogov commented 1 year ago

@WilliamTambellini, you don't necessarily need s4/u4 math to take advantage of low precision. I believe most viable use cases focus on using s4/u4 as a storage formats for weights with math being done in int8 or fp16. So for oneDNN the question effectively boils down to what quantization scheme for these data types would be viable.

WilliamTambellini commented 10 months ago

fyi https://github.com/onnx/onnx/pull/5811

igorsafo commented 10 months ago

Hi @WilliamTambellini , Yes, we are aware of int4 support in OpenVino. The following RFCs target GPT-Q support in oneDNN:

vpirogov commented 8 months ago

API, validation, and GPU optimizations for int4 landed into main branch targeting oneDNN v3.5.

WilliamTambellini commented 8 months ago

tks @vpirogov