Closed qwopqwop200 closed 1 year ago
That's very interesting and promising @qwopqwop200. Do you think that this can be generalized to any model through some wrapper like this?
model = AutoModelForCausalLM.from_pretrained(...)
model = convert_to_4bit(model)
output_ids = model.generate(input_ids)
I think it's difficult if the implementation of the model is not constant. For example, OPT and bloom are mostly similar, but the architecture is different in some parts. For example, in positional embedding, opt uses LearnedPositionalEmbedding, while Bloom uses ALiBi. Due to these differences, some parts of the code may be different. However, most of the code is the same. If you cope with these differences, I think you can be compatible with most(not all) Transformer architectures.
Thanks for the clarifications. If my 2 brain cells did the math right, 4-bit would allow llama-30b to be loaded with about 20GB VRAM. Having that in the web UI would be very nice.
I would love to see this.. imagine the possibilities. Also, does this work on windows?
Kept getting this error.
I assume this might be because I couldn't properly install the CUDA extension, as I was also met with this error.
I would love to see this.. imagine the possibilities. Also, does this work on windows?
Kept getting this error.
I assume this might be because I couldn't properly install the CUDA extension, as I was also met with this error.
I am currently experimenting on windows 11 and installed cuda kernel.. If you can't install it on Windows, you can also use wsl2.
Another question: I see no mention of temperature, top_p, top_k, etc in the code. Is it possible to use those parameters somehow?
My code is based on GPTQ and GPTQ only supports benchmark code for simplicity. Therefore, you need to create a separate code for inference. like this code
Already writing implementations for 4-bit, love it. How fast is the inference time when running llama 30B 4-bit on a 3090?
To be honest, it is not clear to me how to implement this because there is no inference code with some examples to follow. Also, without temperature
, repetition_penalty
, top_p
and top_k
(specifically those 4 parameters), the results would not be good. Maybe someone can help?
It seems like bitsandbytes will have int4 support soon https://github.com/huggingface/transformers/pull/21955#issuecomment-1455235281, but that will probably not be equivalent to GPTQ. Figure 1 in the paper shows a comparison between naive 4-bit quantization (which they call RTN, "round-to-nearest") and their approach, and it is clear that the difference is huge: https://arxiv.org/pdf/2210.17323.pdf
I'm working on converting all the llama variants to 3-bit, keep an eye on the decapoda-research. I'll update here when they're available.
Super, @zoidbb!
I would love to see this.. imagine the possibilities. Also, does this work on windows?
Kept getting this error.
I assume this might be because I couldn't properly install the CUDA extension, as I was also met with this error.
@MetaIX I received this error awhile ago and according to Google, it happens when you don't have nccl installed.
@qwopqwop200 are you aware of any 3-bit or 4-bit inference methods? I can't find anything beyond some theoretical proposal that never got implemented. Without an implementing of 3 or 4-bit inference, there's no way to go forward.
bitsandbytes will have 4-bit inference soon, at which point we should be able to load a 4-bit model quantized via GPTQ and use the bitsandbytes 4-bit inference function against it.
https://mobile.twitter.com/Tim_Dettmers/status/1605209177919750147 "Our analysis is extensive, spanning 5 models (BLOOM, BLOOM, Pythia, GPT-2, OPT), from 3 to 8-bit precision, and from 19M to 66B scale. We find the same result again and again: bit-level scaling improves from 16-bit to 4-bit precision but reverses at 3-bit precision."
"The case for 4-bit precision: k-bit Inference Scaling Laws" https://arxiv.org/abs/2212.09720
3-bit inference results were not too promising across these models in that paper. Their conclusion was that 4-bit is the sweet spot. I expect 4-bit will be superior quality. I would love to be surprised though.
@xNul Thanks for the info. I had some weird stuff going on in the env lol.
@qwopqwop200 So this should be relatively easier to implement since you already did most of the heavy lifting.
https://huggingface.co/decapoda-research/llama-smallint-pt
Quantized checkpoints for 7b/13b/30b are available in both 3-bit and 4-bit. The 3-bit files are the same size as the 4-bit files, amusingly -- likely due to how they're packed. These are not wrapped with Transformers magic, so good luck. Also not sure how to use them for actual inference yet. Will work that out later this week if no one else gets to it. There seem to be some clues in the OPT and BLOOM code inside the GPQT repository.
65b is almost done quantizing, should have those up within the next couple hours in the same repo.
Something seems off. LLaMA-30B is ~60GB in fp16. I would expect it to be around 1/4 of that size in 4bit, ie. 15GB. 12GB is considerably smaller and about the size I would expect 3-bit to be if it was stored efficiently.
If LLaMA-30B fits on a 16GB card in 4-bit with room to spare I'll be very very surprised.
Good work, either way! We're getting somewhere.
Agreed, its quite odd that the 4-bit output is this small. Once I better understand how this works (I haven't had a chance to dig in deep) I might know better why this is happening, and whether this result is incorect.
동의합니다. 4비트 출력이 이렇게 작다는 것은 상당히 이상합니다. 이것이 어떻게 작동하는지 더 잘 이해하면(깊이 파헤칠 기회가 없었습니다) 왜 이런 일이 발생하는지, 그리고 이 결과가 잘못된 것인지 더 잘 알 수 있습니다.
It's probably this small because it's 3-bit quantization. As of now, the code does not support 4-bit quantization.
https://mobile.twitter.com/Tim_Dettmers/status/1605209177919750147 "Our analysis is extensive, spanning 5 models (BLOOM, BLOOM, Pythia, GPT-2, OPT), from 3 to 8-bit precision, and from 19M to 66B scale. We find the same result again and again: bit-level scaling improves from 16-bit to 4-bit precision but reverses at 3-bit precision."
"The case for 4-bit precision: k-bit Inference Scaling Laws" https://arxiv.org/abs/2212.09720
3-bit inference results were not too promising across these models in that paper. Their conclusion was that 4-bit is the sweet spot. I expect 4-bit will be superior quality. I would love to be surprised though.
https://arxiv.org/abs/2212.09720
That paper is zero-shot quantization and according to the paper gptq achieves more robust results at lower bits. It can be found in Table 1, Figure 5 of the paper.
So I don't think 3-bit is worth the effort. To gain real benefits, we would need a working, well-maintained 3-bit CUDA kernel. The CUDA kernel provided by the original GPTQ authors is extremely specialized and pretty much unmaintained by them or any community.
The benefits of GPTQ for 4-bit quantization is negligible vs RTN, so GPTQ really only has a place in 2/3-bit quant. Eventually it would be nice to have this, but given the lack of a robust 3-bit CUDA kernel this is a non-starter for any real project.
Lastly, the engineering behind the original GPTQ codebase is suspect. There are bugs all over the place, it's poorly organized, and poorly documented. It would take more work to turn this into a useful library and maintain it, than is worth at current.
bitsandbytes will be releasing 4-bit support at some point relatively soon. I think it would be best to wait for that, as integration into the existing Transformers library should be straight-forward from that point given the existing 8-bit quantization support.
My two cents, hold off on implementation until we see 4-bit from bitsandbytes.
Taking a closer look at the plot, it seems like the difference between GPTQ and RTN at the ranges we are (or I am) most interested in (10-30b parameters) is indeed not that significant:
The idea of lightly re-optimizing the weights to make up for the loss in accuracy is very appealing though. I hope that it will become a standard in the future.
@zoidbb I am confused, forgetting about 3-bit, will your converted GPTQ 4-bit weights be usable in transformers when the 4-bit bitsandbytes implementation is complete and integrated into transformers or not?
그래서 저는 3비트가 그만한 가치가 있다고 생각하지 않습니다. 실질적인 이점을 얻으려면 제대로 작동하고 관리가 잘 되는 3비트 CUDA 커널이 필요합니다. 원래 GPTQ 작성자가 제공하는 CUDA 커널은 매우 전문화되어 있으며 그들 또는 커뮤니티에서 거의 유지 관리하지 않습니다.
4비트 양자화에 대한 GPTQ의 이점은 RTN에 비해 무시할 수 있으므로 GPTQ는 실제로 2/3비트 양자화에서만 자리를 차지합니다. 궁극적으로 이것이 있으면 좋겠지만 강력한 3비트 CUDA 커널이 없기 때문에 실제 프로젝트의 시작이 아닙니다.
마지막으로 원래 GPTQ 코드베이스의 엔지니어링이 의심스럽습니다. 도처에 버그가 있고 제대로 구성되어 있지 않으며 문서화가 제대로 되어 있지 않습니다. 이것을 유용한 라이브러리로 바꾸고 유지 관리하려면 현재 가치보다 더 많은 작업이 필요합니다.
bitsandbytes는 비교적 빠른 시일 내에 4비트 지원을 해제할 예정입니다. 기존 Transformers 라이브러리로의 통합은 기존 8비트 양자화 지원을 고려할 때 그 시점부터 간단해야 하므로 이를 기다리는 것이 가장 좋을 것이라고 생각합니다.
내 두 센트, 비트와 바이트에서 4비트를 볼 때까지 구현을 보류하십시오.
My code is just for experimentation. Therefore, it may be better to use bitsandbytes.
@zoidbb I am confused, forgetting about 3-bit, will your converted GPTQ 4-bit weights be usable in transformers when the 4-bit bitsandbytes implementation is complete and integrated into transformers or not?
Yes, they should work. That said, I think it would be best to re-quantize them with bitsandbytes once int4 support is out in that library. I'll keep and eye out for that release and publish int4 quantizations when possible.
Although experimental, i implemented a 4bit cuda kernel. i are currently testing.
Model | Bits | group-size | memory(M) | ppl |
---|---|---|---|---|
FP16 | 16 | - | 12980 | 10.90 |
LLaMa-7B | 4 | - | 3780 | 16.63 |
As a result of the current experiment, the memory is greatly reduced.
LLaMA-7B with 3780 MB VRAM
I am impressed
For a more realistic experiment, we tested by setting seqlen to 2048.
Model | Bits | memory(MiB) | benchmark(ppl) | Wikitext2 | PTB | C4 | checkpoint size(GB) |
---|---|---|---|---|---|---|---|
LLaMa-7B with FP16 | 16 | 13940 | 5.23 | 5.67 | 8.79 | 7.05 | 12.5 |
LLaMa-13B with FP16 | 16 | OOM | - | 5.08 | 8.06 | 6.58 | 24.2 |
LLaMa-7B with GPTQ | 4 | 4740 | 6.23 | 6.79 | 10.67 | 8.28 | 3.5 |
LLaMa-13B with GPTQ | 4 | 8410 | 5.14 | 5.35 | 8.40 | 6.82 | 6.5 |
So this will work on pascal and above or is it ampere only?
그래서 이것은 파스칼 이상에서 작동해야 합니까?
I don't have any GPU other than the RTX 3090 so I can't test if it works.
How does LLaMA-30B compare in performance and memory at 3- and 4-bit to the smaller 7B and 13B models? Of course benchmarks will likely be better in all cases, but it would be interesting to see how much.
I don't have enough memory to quantize the LLaMa 33b. But according to this tweet It seems to run with 17gb of memory.
Not sure if it's useful here, but I wrote some int4-fp16 matmul kernels a while ago for a SparseGPT+GPTQ impl that I never finished (joint sparsification hated me); code here: https://github.com/mstnegate/int4matmul_kernels
There's plenty of room for speedup, but I found them fast enough for usable inference at max context length for OPT models (haven't looked at how LLaMa models are laid out.) Hopefully this could tide us over until BNB int4 arrives?
Not sure if it's useful here, but I wrote some int4-fp16 matmul kernels a while ago for a SparseGPT+GPTQ impl that I never finished (joint sparsification hated me); code here: https://github.com/mstnegate/int4matmul_kernels
There's plenty of room for speedup, but I found them fast enough for usable inference at max context length for OPT models (haven't looked at how LLaMa models are laid out.) Hopefully this could tide us over until BNB int4 arrives?
Using SparseGPT+GPTQ together is quite interesting. Considering the current implementation of my cuda kernel, there seems to be no memory advantage. However, I think it can be effective in terms of speed improvement.
Updated cuda kernel to add support for 2,3,8 bits.
I don't have enough memory to quantize the LLaMa 33b. But according to this tweet It seems to run with 17gb of memory.
Ran the benchmark myself on a 4090. 4bit 30B file size is 15.8 GiB
Median: 0.05140483379364014
PPL: 4.606910705566406
max memory(MiB): 19499.111328125
Sadly the custom kernel only supports 1 token input but generation time is reasonable. I used the token "balls"
Generated in 8.1260 seconds
balls-and the 100-year-old tradition of the game.
The game is played on a 100-foot-long, 50-foot-wide field of grass. The object is to get the ball into the goal, which is a 10-foot-high, 20-foot-wide net. The ball is a 3-pound rubber ball. The sticks are 5-foot-long wooden sticks with a net on the end. The game is played with 10 players on each team.
The game is played in two 30-minute halves. The game is played in a clockwise direction. The ball is moved by passing it to other players or by hitting it with the stick. The ball cannot be touched with the hands.
The game is played in 70 countries. The sport is most popular in India, where it is the national sport.
Changed the cuda kernel to support more than 2 tokens.
Changed the cuda kernel to support more than 2 tokens.
Thank you very much. I implemented a super hacky solution to load 30B in 4bit on the webui and I get good speeds on a 4090
Changed the cuda kernel to support more than 2 tokens.
Thank you very much. I implemented a super hacky solution to load 30B in 4bit on the webui and I get good speeds on a 4090
Care to share instructions? tried getting it working but the kernel installation errors out
Attempting to quantize returns with this error
Traceback (most recent call last): File "C:\Syn\txtAI\GPTQ-for-LLaMa\llama.py", line 399, in <module> quantizers = llama_sequential(model, dataloader, DEV) File "C:\Users\User\miniconda3\envs\textgen\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "C:\Syn\txtAI\GPTQ-for-LLaMa\llama.py", line 29, in llama_sequential layers = model.model.layers File "C:\Users\User\miniconda3\envs\textgen\lib\site-packages\torch\nn\modules\module.py", line 1269, in __getattr__ raise AttributeError("'{}' object has no attribute '{}'".format( AttributeError: 'LLaMAModel' object has no attribute 'layers'
I'm not entirely certain of the intricacies of python. I assume conversions need to happen from the "hf" files. I am likely missing something obvious as I've not delved in to ML coding.
Care to share instructions? tried getting it working but the kernel installation errors out
I dont have anything detailed honestly everything just worked for me.
I created a venv.
I installed the requirements for both this webui and the GPTQ repo into the venv.
I installed the kernel from the GPTQ repo into the venv.
I then just dumped the model loading code from GPTQ (with hardcoded paths because im lazy) into models.py
.
The 4bit model now loads and I can run inference with a good UI.
Care to share instructions? tried getting it working but the kernel installation errors out
I dont have anything detailed honestly everything just worked for me. I created a venv. I installed the requirements for both this webui and the GPTQ repo into the venv. I installed the kernel from the GPTQ repo into the venv. I then just dumped the model loading code from GPTQ (with hardcoded paths because im lazy) into
models.py
. The 4bit model now loads and I can run inference with a good UI.
Weird, must be something in my setup, are you on windows or Linux
it appears 4chan have come through with pre converted models, I've not used them yet and can't attest to their safety - https://boards.4channel.org/g/thread/91979249/4bit-llama-is-finally-here
So I am also a little confused, if anyone can ELI5 this for me. My aim here is to stand up the 13B in the webui using 4bit on my Linux box.
Weird, must be something in my setup, are you on windows or Linux
Linux
4. what do I need to do now to tie all this together in the webui? what are the required modifications to model.py, can anyone show me a diff or something?
I dumped the entirety of llama.py
(other than main) into models.py
because I didnt care about code that didnt do anything.
I copied all of the .py files from the GPTQ repo into the modules
folder for the same reason.
I added the --load-in-4bit launch arg to shared.py
.
I then added this code to the top of the if statements in the load models function in models.py
and changed the if on line 44 to elif
if(shared.args.load_in_4bit):
model = load_quant("/home/draff/AI-Stuff/Text/GPTQ-for-LLaMa/models/llama-30b/", "models/LLaMA-30b/llama30b-4bit.pt", 4)
model=model.to(torch.device('cuda:0'))
This is a bad solution and tbh I would just wait for oobabooga to implement something that isn't stupid. I just wanted it to work quickly I didn't care about it working well.
@ItsLogic Thanks for the solution, I can't seem to be able to get past ModuleNotFoundError: No module named 'gptq'
though, I've ran setup_cuda.py and everything...
@ItsLogic Thanks for the solution, I can't seem to be able to get past
ModuleNotFoundError: No module named 'gptq'
though, I've ran setup_cuda.py and everything...Another question - what's the path in the 1st string parameter in load_quant meant to point to?
You need to copy the .py files from the GPTQ_for_LLaMA/ folder to the text-generation-webui/modules/ folder. You can also get away with not merging llama.py into models.py if you call llama.load_quant
instead.
Or alternatively, wait for a more production-ready merge.
@qwopqwop200 great job on the cuda kernel! i'm going to go ahead and work on packaging some of this stuff up into something a bit more user-friendly, since there's so much demand for it :) do you think you might be able to add 8-bit quant support to the quant code? what I'd like to do is try creating two things from this base code you've put together:
1) a tool to quantize any supported huggingface model to 8-bit or 4-bit from fp16/fp32, starting with llama models 2) a library that text-generation-webui can use to load the customized CUDA kernel to achieve higher performance with those quantized models
I'll be pushing trustworthy int4 conversions to the hub today.
A library would be nice.
I have tried loading the model without success so far. Here is what I did:
pip uninstall transformers
pip install git+https://github.com/zphang/transformers@llama_push
Re-convert LLaMA-7b using the updated convert_llama_weights_to_hf.py and put that into the models/llama-7b-new
folder.
Put this file into the models
folder: https://huggingface.co/decapoda-research/llama-smallint-pt/resolve/main/llama-7b-4bit.pt
Load the model with
model = load_quant("models/llama-7b-new", "models/llama-7b-4bit.pt", 4)
model=model.to(torch.device('cuda:0'))
I got this error:
Unexpected key(s) in state_dict: "model.decoder.embed_tokens.weight", "model.decoder.layers.0.self_attn.q_proj.zeros", "model.decoder.layers.0.self_attn.q_proj.scales", "model.decoder.layers.0.self_attn.q_proj.bias", "model.decoder.layers.0.self_attn.q_proj.qweight", "model.decoder.layers.0.self_attn.k_proj.zeros", "model.decoder.layers.0.self_attn.k_proj.scales", "model.decoder.layers.0.self_attn.k_proj.bias", "model.decoder.layers.0.self_attn.k_proj.qweight (...)
Any idea what I am doing wrong?
GPTQ is currently the SOTA one shot quantization method for LLMs. GPTQ supports amazingly low 3-bit and 4-bit weight quantization. And it can be applied to LLaMa. I've actually confirmed that this works well in LLaMa 7b. I haven't tested the memory usage(n-bit cuda kernel), but I think it should work.
code: https://github.com/qwopqwop200/GPTQ-for-LLaMa