[Feature Request] 4bit and 2bit and 1bit quantization support

elephantpanda commented 1 year ago

Describe the feature request

Support for quantizing and running quantized models in 4bit, 2bit and 1bit. Also saving and loading these models in onnx format for lower file sizes.

The GPU doesn't necessarily have to support 4bit operations since it can just use gpu cores to convert them to float operations or int8 operations when needed.

Describe scenario use case

Some models such as Large Language Models are very big but run fairly well when quantized down to 8bit, 4bit, 2bit or even 1bit.

jchen351 commented 1 year ago

Hi, Pauldog thanks for reaching out. We have received your message and put these requests under consideration!

Thank you for your time,

Jian Chen (not a A.I.)

jchen351 commented 1 year ago

Also Could you please provide me more information about your scenarios, like: hardware to you wants to run on, and models you are interested in? Again, out currently priority is on fp16 support. And there isn't any hardware we have that supports the 4bit or lower.

elephantpanda commented 1 year ago

Sure here is a very recent example of a practical use case:

Llama 4bit

As far as I'm aware it doesn't require 4bit hardware it simply stores the weights on the GPU in 4bit, then uses GPU cores at runtime to convert them to int8 or float16 at runtime to do the calculations.

The main benefit is the ability to run larger models on the same hardware.

Use cases would be

Running very large language models on consumer hardware
Running large models on mobile hardware

Here are some papers

https://arxiv.org/abs/1810.05723 https://arxiv.org/abs/2202.05292

and articles https://karanbirchahal.medium.com/aggressive-quantization-how-to-run-mnist-on-a-4-bit-neural-net-using-pytorch-5703f3faa599

Now, I don't know whether onnxruntime already can support this or not? Since technically say a 4bit quantized model would presumably appear like an 8bit quantized model as two 4bits are combined into one 8bit.

josephrocca commented 1 year ago

Hey @jchen351, I'm wondering why this is closed? Shouldn't it stay open if this is being considered?

The WebML ecosystem in particular could really do with a 4-bit quantization solution, since model size is such an important factor on the web.

xenova commented 1 year ago

100% agree with @josephrocca. 4-bit quantization would be massive for my Transformers.js library (and other WebML libraries)!

jchen351 commented 1 year ago

@xenova @josephrocca The only hardware we know that can support 4 bit quantization with performance gain is Nvidia A100, but we cannot get our hands on enough A100, and the newer H100 has dropped that support. We don't foresee any performance gain in doing 4 bit quantization on any other popular hardwares. So, until them, I will keep this closed :)

xenova commented 1 year ago

This repo supports 4-bit quantization: https://github.com/ggerganov/llama.cpp (And, as stated in the README, it runs on the CPU)

Also, considering that WASM uses a 32-bit address space (i.e., max 4GB), the only real way to get large models running on consumer hardware is quantization.

josephrocca commented 1 year ago

@jchen351, yes, as xenova pointed out, this is more about running large models on hardware that has a small amount of memory, rather than performance improvements.

For example, please see this demo of llama 7B running on a pixel 5 at 1 token/sec using 4 bit quantization: https://twitter.com/ggerganov/status/1635605532726681600

So this issue can probably be re-opened considering it is viable to gain this benefit without hardware support? llama.cpp has grown faster than the original stable diffusion repo (which was one of the fastest growing of all time) because it allows people to run big models on small hardware -- there's definitely demand for this! :)

skyne98 commented 1 year ago

@jchen351, can we have a second look at this? It's not really about performance, but rather allowing running models in places they couldn't before. I insist!

It just seems like the points that guys made, which are really valid, got seemingly plainly ignored.

tikikun commented 1 year ago

Re-open please, everyone is using 4-5bit quantization now

jywu-msft commented 1 year ago

re-opening this. this should not be closed.

jywu-msft commented 1 year ago

+@yufenglee FYI

ThisisBillhe commented 1 year ago

Hi everyone! I have successfully quantized a diffusion model to 2-bit and manually packed them into uint8 format (store 4x 2-bit weight in an uint8 variable) in pytorch. During inference, they are unpacked to float format for calculation. In this way, the model size has been reduced from 1545M to 150M, and the VRAM for loading the model is also greatly reduced (from 2500M to 1000M) in pytorch. However, when I export the model to onnx, only the model size is reduced (to around 190M), the VRAM for loading the model can still reach 3000M. I guess the uint8 parameters are cast to int32 or float32 during loading the onnx model.

Any ideas on how to lower the VRAM for loading this ONNX model? I have upload the model at googledrive.

elephantpanda commented 1 year ago

Hi everyone! I have successfully quantized a diffusion model to 2-bit and manually packed them into uint8 format (store 4x 2-bit weight in an uint8 variable) in pytorch. During inference, they are unpacked to float format for calculation. In this way, the model size has been reduced from 1545M to 150M, and the VRAM for loading the model is also greatly reduced (from 2500M to 1000M) in pytorch. However, when I export the model to onnx, only the model size is reduced (to around 190M), the VRAM for loading the model can still reach 3000M. I guess the uint8 parameters are cast to int32 or float32 during loading the onnx model.

Any ideas on how to lower the VRAM for loading this ONNX model? I have upload the model at googledrive.

2-bit diffusion model? Does it actually produce images?

Guess you could try packing 16 2-bits into an int32.

ThisisBillhe commented 1 year ago

Hi everyone! I have successfully quantized a diffusion model to 2-bit and manually packed them into uint8 format (store 4x 2-bit weight in an uint8 variable) in pytorch. During inference, they are unpacked to float format for calculation. In this way, the model size has been reduced from 1545M to 150M, and the VRAM for loading the model is also greatly reduced (from 2500M to 1000M) in pytorch. However, when I export the model to onnx, only the model size is reduced (to around 190M), the VRAM for loading the model can still reach 3000M. I guess the uint8 parameters are cast to int32 or float32 during loading the onnx model. Any ideas on how to lower the VRAM for loading this ONNX model? I have upload the model at googledrive.

2-bit diffusion model? Does it actually produce images?

Guess you could try packing 16 2-bits into an int32.

The work is in progress..I guess you make a point, I will have a try.

dfiru commented 1 year ago

are there any branches or forks of the 2 x 4bit packing?

josephrocca commented 1 year ago

I noticed this point in the v1.16.0 release notes (3 weeks ago):

Support 4-bit quantization on CPU

I haven't tried it yet. @xenova I'm curious if you've tried this yet with the Web Wasm backend?

dfiru commented 1 year ago

https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/quantization/quant_utils.py#L71 QuantType still doesn't include it.

Fritskee commented 8 months ago

Any updates on this issue?

ogencoglu commented 8 months ago

4 bit would indeed be great. Any updates?

ideasbyjin commented 4 months ago

Being able to convert a HF model for 4-bit quantization would be awesome!!

yufenglee commented 4 months ago

Being able to convert a HF model for 4-bit quantization would be awesome!!

The QLLM tool can convert a 4-bit HF model to ONNX: https://github.com/wejoncy/QLLM. And a tool from ORT Generate API can also convert it with this PR:https://github.com/microsoft/onnxruntime-genai/pull/600

ideasbyjin commented 4 months ago

Thanks, I might be missing something but for my models (which are encoder-only models), I'm not sure how to get it to work. I was able to 4-bit quantize it using BitsAndBytes on HF, but not export it to ONNX

elephantpanda commented 3 months ago

Hi I see ONNX is now supporting 4bit data type. Is there any more information about how to make use of these and do quantization down to 4bit?

microsoft / onnxruntime

[Feature Request] 4bit and 2bit and 1bit quantization support #14997

Describe the feature request

Describe scenario use case