Closed WilliamTambellini closed 5 months ago
This is on our radar, in particular in context of large language models. As the references indicate this area is an active field of research, in particular focusing on techniques to minimize accuracy loss. Are there any specific quantization approaches and usage models you have in mind? Anything validated in production setting?
Are there any specific quantization approaches and usage models you have in mind?
Cannot legally say much but there are already some opensource LLM quantizers : https://github.com/PanQiWei/AutoGPTQ but they could requires some samples to quantize on. Be aware of static vs dynamic quant. Most Transformer decoder would do the job for you to test, eg: https://huggingface.co/Qwen/Qwen-7B-Chat-Int4#%E9%87%8F%E5%8C%96-quantization
Anything validated in production setting?
Cannot reply publicly but as long as the perplexity after quantization is "close" to the one with bf16 then you should be good.
Could you confirm that SapphireRapids CPUs (4th gen) dont seem to have any hardware support (AMX, ..) for s4 nor u4 math ?
@WilliamTambellini, you don't necessarily need s4/u4 math to take advantage of low precision. I believe most viable use cases focus on using s4/u4 as a storage formats for weights with math being done in int8 or fp16. So for oneDNN the question effectively boils down to what quantization scheme for these data types would be viable.
Hi @WilliamTambellini , Yes, we are aware of int4 support in OpenVino. The following RFCs target GPT-Q support in oneDNN:
API, validation, and GPU optimizations for int4 landed into main branch targeting oneDNN v3.5.
tks @vpirogov
Summary
Add support for INT4 and/or UINT4 Refs: https://intellabs.github.io/distiller/quantization.html https://developer.nvidia.com/blog/int4-for-ai-inference/ https://arxiv.org/abs/2301.12017 https://arxiv.org/pdf/2306.11987.pdf https://www.xilinx.com/support/documents/white_papers/wp521-4bit-optimization.pdf
Problem statement
Describe the problem you are trying to solve with reasonable level of details. Fast low 4bit precision quantized matmul.
Preferred solution
A new onednn datatype and at least quantmatmul (no need for full arithm/math).