GPTQ quantization(3 or 4 bit quantization) support for LLaMa

qwopqwop200 commented 1 year ago

GPTQ is currently the SOTA one shot quantization method for LLMs. GPTQ supports amazingly low 3-bit and 4-bit weight quantization. And it can be applied to LLaMa. I've actually confirmed that this works well in LLaMa 7b. I haven't tested the memory usage(n-bit cuda kernel), but I think it should work.

Model(LLaMa-7B)	Bits	group-size	Wikitext2	PTB	C4
FP16	16	-	5.67	8.79	7.05
RTN	4	-	6.28	9.68	7.70
GPTQ	4	64	6.16	9.66	7.52
RTN	3	-	25.66	61.25	28.19
GPTQ	3	64	12.24	16.77	9.55

code: https://github.com/qwopqwop200/GPTQ-for-LLaMa

oobabooga commented 1 year ago

That's very interesting and promising @qwopqwop200. Do you think that this can be generalized to any model through some wrapper like this?

model  = AutoModelForCausalLM.from_pretrained(...)
model = convert_to_4bit(model)

output_ids = model.generate(input_ids)

qwopqwop200 commented 1 year ago

I think it's difficult if the implementation of the model is not constant. For example, OPT and bloom are mostly similar, but the architecture is different in some parts. For example, in positional embedding, opt uses LearnedPositionalEmbedding, while Bloom uses ALiBi. Due to these differences, some parts of the code may be different. However, most of the code is the same. If you cope with these differences, I think you can be compatible with most(not all) Transformer architectures.

oobabooga commented 1 year ago

Thanks for the clarifications. If my 2 brain cells did the math right, 4-bit would allow llama-30b to be loaded with about 20GB VRAM. Having that in the web UI would be very nice.

MetaIX commented 1 year ago

I would love to see this.. imagine the possibilities. Also, does this work on windows?

Kept getting this error.

I assume this might be because I couldn't properly install the CUDA extension, as I was also met with this error.

qwopqwop200 commented 1 year ago

I would love to see this.. imagine the possibilities. Also, does this work on windows?

Kept getting this error.

I assume this might be because I couldn't properly install the CUDA extension, as I was also met with this error.

I am currently experimenting on windows 11 and installed cuda kernel.. If you can't install it on Windows, you can also use wsl2.

oobabooga commented 1 year ago

Another question: I see no mention of temperature, top_p, top_k, etc in the code. Is it possible to use those parameters somehow?

qwopqwop200 commented 1 year ago

My code is based on GPTQ and GPTQ only supports benchmark code for simplicity. Therefore, you need to create a separate code for inference. like this code

musicurgy commented 1 year ago

Already writing implementations for 4-bit, love it. How fast is the inference time when running llama 30B 4-bit on a 3090?

oobabooga commented 1 year ago

To be honest, it is not clear to me how to implement this because there is no inference code with some examples to follow. Also, without temperature, repetition_penalty, top_p and top_k (specifically those 4 parameters), the results would not be good. Maybe someone can help?

It seems like bitsandbytes will have int4 support soon https://github.com/huggingface/transformers/pull/21955#issuecomment-1455235281, but that will probably not be equivalent to GPTQ. Figure 1 in the paper shows a comparison between naive 4-bit quantization (which they call RTN, "round-to-nearest") and their approach, and it is clear that the difference is huge: https://arxiv.org/pdf/2210.17323.pdf

dustydecapod commented 1 year ago

I'm working on converting all the llama variants to 3-bit, keep an eye on the decapoda-research. I'll update here when they're available.

oobabooga commented 1 year ago

Super, @zoidbb!

xNul commented 1 year ago

I would love to see this.. imagine the possibilities. Also, does this work on windows?

Kept getting this error.

I assume this might be because I couldn't properly install the CUDA extension, as I was also met with this error.

@MetaIX I received this error awhile ago and according to Google, it happens when you don't have nccl installed.

dustydecapod commented 1 year ago

@qwopqwop200 are you aware of any 3-bit or 4-bit inference methods? I can't find anything beyond some theoretical proposal that never got implemented. Without an implementing of 3 or 4-bit inference, there's no way to go forward.

bitsandbytes will have 4-bit inference soon, at which point we should be able to load a 4-bit model quantized via GPTQ and use the bitsandbytes 4-bit inference function against it.

MarkSchmidty commented 1 year ago

https://mobile.twitter.com/Tim_Dettmers/status/1605209177919750147 "Our analysis is extensive, spanning 5 models (BLOOM, BLOOM, Pythia, GPT-2, OPT), from 3 to 8-bit precision, and from 19M to 66B scale. We find the same result again and again: bit-level scaling improves from 16-bit to 4-bit precision but reverses at 3-bit precision."

"The case for 4-bit precision: k-bit Inference Scaling Laws" https://arxiv.org/abs/2212.09720

3-bit inference results were not too promising across these models in that paper. Their conclusion was that 4-bit is the sweet spot. I expect 4-bit will be superior quality. I would love to be surprised though.

MetaIX commented 1 year ago

@xNul Thanks for the info. I had some weird stuff going on in the env lol.

@qwopqwop200 So this should be relatively easier to implement since you already did most of the heavy lifting.

dustydecapod commented 1 year ago

https://huggingface.co/decapoda-research/llama-smallint-pt

Quantized checkpoints for 7b/13b/30b are available in both 3-bit and 4-bit. The 3-bit files are the same size as the 4-bit files, amusingly -- likely due to how they're packed. These are not wrapped with Transformers magic, so good luck. Also not sure how to use them for actual inference yet. Will work that out later this week if no one else gets to it. There seem to be some clues in the OPT and BLOOM code inside the GPQT repository.

65b is almost done quantizing, should have those up within the next couple hours in the same repo.

MarkSchmidty commented 1 year ago

Something seems off. LLaMA-30B is ~60GB in fp16. I would expect it to be around 1/4 of that size in 4bit, ie. 15GB. 12GB is considerably smaller and about the size I would expect 3-bit to be if it was stored efficiently.

If LLaMA-30B fits on a 16GB card in 4-bit with room to spare I'll be very very surprised.

Good work, either way! We're getting somewhere.

dustydecapod commented 1 year ago

Agreed, its quite odd that the 4-bit output is this small. Once I better understand how this works (I haven't had a chance to dig in deep) I might know better why this is happening, and whether this result is incorect.

qwopqwop200 commented 1 year ago

동의합니다. 4비트 출력이 이렇게 작다는 것은 상당히 이상합니다. 이것이 어떻게 작동하는지 더 잘 이해하면(깊이 파헤칠 기회가 없었습니다) 왜 이런 일이 발생하는지, 그리고 이 결과가 잘못된 것인지 더 잘 알 수 있습니다.

It's probably this small because it's 3-bit quantization. As of now, the code does not support 4-bit quantization.

qwopqwop200 commented 1 year ago

https://mobile.twitter.com/Tim_Dettmers/status/1605209177919750147 "Our analysis is extensive, spanning 5 models (BLOOM, BLOOM, Pythia, GPT-2, OPT), from 3 to 8-bit precision, and from 19M to 66B scale. We find the same result again and again: bit-level scaling improves from 16-bit to 4-bit precision but reverses at 3-bit precision."

"The case for 4-bit precision: k-bit Inference Scaling Laws" https://arxiv.org/abs/2212.09720

3-bit inference results were not too promising across these models in that paper. Their conclusion was that 4-bit is the sweet spot. I expect 4-bit will be superior quality. I would love to be surprised though.

https://arxiv.org/abs/2212.09720

That paper is zero-shot quantization and according to the paper gptq achieves more robust results at lower bits. It can be found in Table 1, Figure 5 of the paper.

dustydecapod commented 1 year ago

So I don't think 3-bit is worth the effort. To gain real benefits, we would need a working, well-maintained 3-bit CUDA kernel. The CUDA kernel provided by the original GPTQ authors is extremely specialized and pretty much unmaintained by them or any community.

The benefits of GPTQ for 4-bit quantization is negligible vs RTN, so GPTQ really only has a place in 2/3-bit quant. Eventually it would be nice to have this, but given the lack of a robust 3-bit CUDA kernel this is a non-starter for any real project.

Lastly, the engineering behind the original GPTQ codebase is suspect. There are bugs all over the place, it's poorly organized, and poorly documented. It would take more work to turn this into a useful library and maintain it, than is worth at current.

bitsandbytes will be releasing 4-bit support at some point relatively soon. I think it would be best to wait for that, as integration into the existing Transformers library should be straight-forward from that point given the existing 8-bit quantization support.

My two cents, hold off on implementation until we see 4-bit from bitsandbytes.

oobabooga commented 1 year ago

Taking a closer look at the plot, it seems like the difference between GPTQ and RTN at the ranges we are (or I am) most interested in (10-30b parameters) is indeed not that significant:

out

The idea of lightly re-optimizing the weights to make up for the loss in accuracy is very appealing though. I hope that it will become a standard in the future.

oobabooga commented 1 year ago

@zoidbb I am confused, forgetting about 3-bit, will your converted GPTQ 4-bit weights be usable in transformers when the 4-bit bitsandbytes implementation is complete and integrated into transformers or not?

qwopqwop200 commented 1 year ago

그래서 저는 3비트가 그만한 가치가 있다고 생각하지 않습니다. 실질적인 이점을 얻으려면 제대로 작동하고 관리가 잘 되는 3비트 CUDA 커널이 필요합니다. 원래 GPTQ 작성자가 제공하는 CUDA 커널은 매우 전문화되어 있으며 그들 또는 커뮤니티에서 거의 유지 관리하지 않습니다.

4비트 양자화에 대한 GPTQ의 이점은 RTN에 비해 무시할 수 있으므로 GPTQ는 실제로 2/3비트 양자화에서만 자리를 차지합니다. 궁극적으로 이것이 있으면 좋겠지만 강력한 3비트 CUDA 커널이 없기 때문에 실제 프로젝트의 시작이 아닙니다.

마지막으로 원래 GPTQ 코드베이스의 엔지니어링이 의심스럽습니다. 도처에 버그가 있고 제대로 구성되어 있지 않으며 문서화가 제대로 되어 있지 않습니다. 이것을 유용한 라이브러리로 바꾸고 유지 관리하려면 현재 가치보다 더 많은 작업이 필요합니다.

bitsandbytes는 비교적 빠른 시일 내에 4비트 지원을 해제할 예정입니다. 기존 Transformers 라이브러리로의 통합은 기존 8비트 양자화 지원을 고려할 때 그 시점부터 간단해야 하므로 이를 기다리는 것이 가장 좋을 것이라고 생각합니다.

내 두 센트, 비트와 바이트에서 4비트를 볼 때까지 구현을 보류하십시오.

My code is just for experimentation. Therefore, it may be better to use bitsandbytes.

dustydecapod commented 1 year ago

@zoidbb I am confused, forgetting about 3-bit, will your converted GPTQ 4-bit weights be usable in transformers when the 4-bit bitsandbytes implementation is complete and integrated into transformers or not?

Yes, they should work. That said, I think it would be best to re-quantize them with bitsandbytes once int4 support is out in that library. I'll keep and eye out for that release and publish int4 quantizations when possible.

qwopqwop200 commented 1 year ago

Although experimental, i implemented a 4bit cuda kernel. i are currently testing.

qwopqwop200 commented 1 year ago

Model	Bits	group-size	memory(M)	ppl
FP16	16	-	12980	10.90
LLaMa-7B	4	-	3780	16.63

As a result of the current experiment, the memory is greatly reduced.

oobabooga commented 1 year ago

LLaMA-7B with 3780 MB VRAM

I am impressed

qwopqwop200 commented 1 year ago

For a more realistic experiment, we tested by setting seqlen to 2048.

Memory Usage

Model	Bits	memory(MiB)	benchmark(ppl)	Wikitext2	PTB	C4	checkpoint size(GB)
LLaMa-7B with FP16	16	13940	5.23	5.67	8.79	7.05	12.5
LLaMa-13B with FP16	16	OOM	-	5.08	8.06	6.58	24.2
LLaMa-7B with GPTQ	4	4740	6.23	6.79	10.67	8.28	3.5
LLaMa-13B with GPTQ	4	8410	5.14	5.35	8.40	6.82	6.5

qwopqwop200 commented 1 year ago

그래서 이것은 파스칼 이상에서 작동해야 합니까?

I don't have any GPU other than the RTX 3090 so I can't test if it works.

BugReporterZ commented 1 year ago

How does LLaMA-30B compare in performance and memory at 3- and 4-bit to the smaller 7B and 13B models? Of course benchmarks will likely be better in all cases, but it would be interesting to see how much.

qwopqwop200 commented 1 year ago

I don't have enough memory to quantize the LLaMa 33b. But according to this tweet It seems to run with 17gb of memory.

mstnegate commented 1 year ago

Not sure if it's useful here, but I wrote some int4-fp16 matmul kernels a while ago for a SparseGPT+GPTQ impl that I never finished (joint sparsification hated me); code here: https://github.com/mstnegate/int4matmul_kernels

There's plenty of room for speedup, but I found them fast enough for usable inference at max context length for OPT models (haven't looked at how LLaMa models are laid out.) Hopefully this could tide us over until BNB int4 arrives?

qwopqwop200 commented 1 year ago

Not sure if it's useful here, but I wrote some int4-fp16 matmul kernels a while ago for a SparseGPT+GPTQ impl that I never finished (joint sparsification hated me); code here: https://github.com/mstnegate/int4matmul_kernels

There's plenty of room for speedup, but I found them fast enough for usable inference at max context length for OPT models (haven't looked at how LLaMa models are laid out.) Hopefully this could tide us over until BNB int4 arrives?

Using SparseGPT+GPTQ together is quite interesting. Considering the current implementation of my cuda kernel, there seems to be no memory advantage. However, I think it can be effective in terms of speed improvement.

qwopqwop200 commented 1 year ago

Updated cuda kernel to add support for 2,3,8 bits.

ItsLogic commented 1 year ago

I don't have enough memory to quantize the LLaMa 33b. But according to this tweet It seems to run with 17gb of memory.

Ran the benchmark myself on a 4090. 4bit 30B file size is 15.8 GiB

Median: 0.05140483379364014
PPL: 4.606910705566406
max memory(MiB): 19499.111328125

Sadly the custom kernel only supports 1 token input but generation time is reasonable. I used the token "balls"

Generated in 8.1260 seconds
balls－and the 100-year-old tradition of the game.
The game is played on a 100-foot-long, 50-foot-wide field of grass. The object is to get the ball into the goal, which is a 10-foot-high, 20-foot-wide net. The ball is a 3-pound rubber ball. The sticks are 5-foot-long wooden sticks with a net on the end. The game is played with 10 players on each team.
The game is played in two 30-minute halves. The game is played in a clockwise direction. The ball is moved by passing it to other players or by hitting it with the stick. The ball cannot be touched with the hands.
The game is played in 70 countries. The sport is most popular in India, where it is the national sport.

qwopqwop200 commented 1 year ago

Changed the cuda kernel to support more than 2 tokens.

ItsLogic commented 1 year ago

Changed the cuda kernel to support more than 2 tokens.

Thank you very much. I implemented a super hacky solution to load 30B in 4bit on the webui and I get good speeds on a 4090

devilismyfriend commented 1 year ago

Changed the cuda kernel to support more than 2 tokens.

Thank you very much. I implemented a super hacky solution to load 30B in 4bit on the webui and I get good speeds on a 4090

Care to share instructions? tried getting it working but the kernel installation errors out

grimbit commented 1 year ago

Attempting to quantize returns with this error

Traceback (most recent call last): File "C:\Syn\txtAI\GPTQ-for-LLaMa\llama.py", line 399, in <module> quantizers = llama_sequential(model, dataloader, DEV) File "C:\Users\User\miniconda3\envs\textgen\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "C:\Syn\txtAI\GPTQ-for-LLaMa\llama.py", line 29, in llama_sequential layers = model.model.layers File "C:\Users\User\miniconda3\envs\textgen\lib\site-packages\torch\nn\modules\module.py", line 1269, in __getattr__ raise AttributeError("'{}' object has no attribute '{}'".format( AttributeError: 'LLaMAModel' object has no attribute 'layers'

I'm not entirely certain of the intricacies of python. I assume conversions need to happen from the "hf" files. I am likely missing something obvious as I've not delved in to ML coding.

ItsLogic commented 1 year ago

Care to share instructions? tried getting it working but the kernel installation errors out

I dont have anything detailed honestly everything just worked for me. I created a venv. I installed the requirements for both this webui and the GPTQ repo into the venv. I installed the kernel from the GPTQ repo into the venv. I then just dumped the model loading code from GPTQ (with hardcoded paths because im lazy) into models.py. The 4bit model now loads and I can run inference with a good UI.

devilismyfriend commented 1 year ago

Care to share instructions? tried getting it working but the kernel installation errors out

I dont have anything detailed honestly everything just worked for me. I created a venv. I installed the requirements for both this webui and the GPTQ repo into the venv. I installed the kernel from the GPTQ repo into the venv. I then just dumped the model loading code from GPTQ (with hardcoded paths because im lazy) into models.py. The 4bit model now loads and I can run inference with a good UI.

Weird, must be something in my setup, are you on windows or Linux

olihough86 commented 1 year ago

it appears 4chan have come through with pre converted models, I've not used them yet and can't attest to their safety - https://boards.4channel.org/g/thread/91979249/4bit-llama-is-finally-here

So I am also a little confused, if anyone can ELI5 this for me. My aim here is to stand up the 13B in the webui using 4bit on my Linux box.

I have the webui working fine and can load 7b in 8bit with a hacky patch I have 13b with 8bit + offloading (hey I'm stuffing all this into a 2080Ti)
I have run setup_cuda.py build and then install from https://github.com/qwopqwop200/GPTQ-for-LLaMa and I don't see any erros so I assume that's all good
4bit models are downloading now and will in the right place eg. models/llama-13b-4bit
what do I need to do now to tie all this together in the webui? what are the required modifications to model.py, can anyone show me a diff or something?

ItsLogic commented 1 year ago

Weird, must be something in my setup, are you on windows or Linux

Linux

4. what do I need to do now to tie all this together in the webui? what are the required modifications to model.py, can anyone show me a diff or something?

I dumped the entirety of llama.py (other than main) into models.py because I didnt care about code that didnt do anything. I copied all of the .py files from the GPTQ repo into the modules folder for the same reason. I added the --load-in-4bit launch arg to shared.py. I then added this code to the top of the if statements in the load models function in models.py and changed the if on line 44 to elif

    if(shared.args.load_in_4bit):
        model = load_quant("/home/draff/AI-Stuff/Text/GPTQ-for-LLaMa/models/llama-30b/", "models/LLaMA-30b/llama30b-4bit.pt", 4)
        model=model.to(torch.device('cuda:0'))

This is a bad solution and tbh I would just wait for oobabooga to implement something that isn't stupid. I just wanted it to work quickly I didn't care about it working well.

David-337 commented 1 year ago

@ItsLogic Thanks for the solution, I can't seem to be able to get past ModuleNotFoundError: No module named 'gptq' though, I've ran setup_cuda.py and everything...

jtang613 commented 1 year ago

@ItsLogic Thanks for the solution, I can't seem to be able to get past ModuleNotFoundError: No module named 'gptq' though, I've ran setup_cuda.py and everything...

Another question - what's the path in the 1st string parameter in load_quant meant to point to?

You need to copy the .py files from the GPTQ_for_LLaMA/ folder to the text-generation-webui/modules/ folder. You can also get away with not merging llama.py into models.py if you call llama.load_quant instead.

Or alternatively, wait for a more production-ready merge.

dustydecapod commented 1 year ago

@qwopqwop200 great job on the cuda kernel! i'm going to go ahead and work on packaging some of this stuff up into something a bit more user-friendly, since there's so much demand for it :) do you think you might be able to add 8-bit quant support to the quant code? what I'd like to do is try creating two things from this base code you've put together:

1) a tool to quantize any supported huggingface model to 8-bit or 4-bit from fp16/fp32, starting with llama models 2) a library that text-generation-webui can use to load the customized CUDA kernel to achieve higher performance with those quantized models

dustydecapod commented 1 year ago

I'll be pushing trustworthy int4 conversions to the hub today.

oobabooga commented 1 year ago

A library would be nice.

I have tried loading the model without success so far. Here is what I did:

Install the updated pull request code:

pip uninstall transformers
pip install git+https://github.com/zphang/transformers@llama_push

Re-convert LLaMA-7b using the updated convert_llama_weights_to_hf.py and put that into the models/llama-7b-new folder.
Put this file into the models folder: https://huggingface.co/decapoda-research/llama-smallint-pt/resolve/main/llama-7b-4bit.pt
Load the model with

        model = load_quant("models/llama-7b-new", "models/llama-7b-4bit.pt", 4)
        model=model.to(torch.device('cuda:0'))

I got this error:

Unexpected key(s) in state_dict: "model.decoder.embed_tokens.weight", "model.decoder.layers.0.self_attn.q_proj.zeros", "model.decoder.layers.0.self_attn.q_proj.scales", "model.decoder.layers.0.self_attn.q_proj.bias", "model.decoder.layers.0.self_attn.q_proj.qweight", "model.decoder.layers.0.self_attn.k_proj.zeros", "model.decoder.layers.0.self_attn.k_proj.scales", "model.decoder.layers.0.self_attn.k_proj.bias", "model.decoder.layers.0.self_attn.k_proj.qweight (...)

Any idea what I am doing wrong?

sgsdxzy commented 1 year ago

@ItsLogic Thanks, according to your steps I get 13B working on my 3080Ti. However I find the response time for chat mode is very slow. It takes a long time to start loading up the gpu and generate text. As a chatbot this is annoying. On the other hand in notebook mode it works like a charm. Wondering why.

oobabooga / text-generation-webui

GPTQ quantization(3 or 4 bit quantization) support for LLaMa #177

Memory Usage