oobabooga / text-generation-webui

A Gradio web UI for Large Language Models.
GNU Affero General Public License v3.0
39.41k stars 5.18k forks source link

GPTQ quantization(3 or 4 bit quantization) support for LLaMa #177

Closed qwopqwop200 closed 1 year ago

qwopqwop200 commented 1 year ago

GPTQ is currently the SOTA one shot quantization method for LLMs. GPTQ supports amazingly low 3-bit and 4-bit weight quantization. And it can be applied to LLaMa. I've actually confirmed that this works well in LLaMa 7b. I haven't tested the memory usage(n-bit cuda kernel), but I think it should work.

Model(LLaMa-7B) Bits group-size Wikitext2 PTB C4
FP16 16 - 5.67 8.79 7.05
RTN 4 - 6.28 9.68 7.70
GPTQ 4 64 6.16 9.66 7.52
RTN 3 - 25.66 61.25 28.19
GPTQ 3 64 12.24 16.77 9.55

code: https://github.com/qwopqwop200/GPTQ-for-LLaMa

sgsdxzy commented 1 year ago

@ItsLogic Thanks, according to your steps I get 13B working on my 3080Ti. However I find the response time for chat mode is very slow. It takes a long time to start loading up the gpu and generate text. As a chatbot this is annoying. On the other hand in notebook mode it works like a charm. Wondering why.

rohvani commented 1 year ago

A library would be nice.

I have tried loading the model without success so far. Here is what I did:

1. Install the updated pull request code:
pip uninstall transformers
pip install git+https://github.com/zphang/transformers@llama_push
2. Re-convert LLaMA-7b using the updated [convert_llama_weights_to_hf.py](https://github.com/zphang/transformers/blob/llama_push/src/transformers/models/llama/convert_llama_weights_to_hf.py) and put that into the `models/llama-7b-new` folder.

3. Put this file into the `models` folder: https://huggingface.co/decapoda-research/llama-smallint-pt/resolve/main/llama-7b-4bit.pt

4. Load the model with
        model = load_quant("models/llama-7b-new", "models/llama-7b-4bit.pt", 4)
        model=model.to(torch.device('cuda:0'))

I got this error:

Unexpected key(s) in state_dict: "model.decoder.embed_tokens.weight", "model.decoder.layers.0.self_attn.q_proj.zeros", "model.decoder.layers.0.self_attn.q_proj.scales", "model.decoder.layers.0.self_attn.q_proj.bias", "model.decoder.layers.0.self_attn.q_proj.qweight", "model.decoder.layers.0.self_attn.k_proj.zeros", "model.decoder.layers.0.self_attn.k_proj.scales", "model.decoder.layers.0.self_attn.k_proj.bias", "model.decoder.layers.0.self_attn.k_proj.qweight (...)

Any idea what I am doing wrong?

I ran into this issue as well when I used the originally released pytorch converted files from the facebook leak .

I had to reconvert all the models from the original facebook leak -- it seems that the 16bit HF models that were published previously, were converted without the transformer changes which changed (or added?) handling of these additional layer properties.

jtang613 commented 1 year ago

@sgsdxzy You might not be using the GPTQ loader. I ran into that also and had to "if True" the line 45. Put a print() above your load_quant() to see if it's being reaached or not.

dustydecapod commented 1 year ago

@rohvani i just pushed fresh conversions to the hub, under decapoda-research. the transformers lib's changes have been constantly evolving, it takes a lot of time to keep up -- especially since im publishing all 4 model sizes.

rohvani commented 1 year ago

Awesome, thanks @zoidbb 😄 your HF repos have been a huge help! Is there a discord/chat for decapoda-research?

dustydecapod commented 1 year ago

not currently. if i end up having time to dedicate to this more fully, I might put something together. right now i'm doing this as a fun little hobby during sabbatical.

devilismyfriend commented 1 year ago

I'm getting this when trying to setup the kernel:

The current installed version of g++ (11.3.0) is greater than the maximum required version by CUDA 11.3 (10.0.0). Please make sure to use an adequate version of g++ (>=5.0.0, <=10.0.0).

sgsdxzy commented 1 year ago

@sgsdxzy You might not be using the GPTQ loader. I ran into that also and had to "if True" the line 45. Put a print() above your load_quant() to see if it's being reaached or not.

I can make sure model=llama.load_quant(... is executed. Otherwise the original 13B model cannot fit into a 3080Ti.

David-337 commented 1 year ago

I'm getting this when trying to setup the kernel:

The current installed version of g++ (11.3.0) is greater than the maximum required version by CUDA 11.3 (10.0.0). Please make sure to use an adequate version of g++ (>=5.0.0, <=10.0.0).

One way to get past that is to update your CUDA version to 11.6 or 11.7

I am still however stuck on the ModuleNotFoundError: No module named 'gptq'

Feels like trying to get various early optimizations of Stable Diffusion running back in the day haha, good times.

dustydecapod commented 1 year ago

I'm getting this when trying to setup the kernel:

The current installed version of g++ (11.3.0) is greater than the maximum required version by CUDA 11.3 (10.0.0). Please make sure to use an adequate version of g++ (>=5.0.0, <=10.0.0).

you need a newer version of CUDA. i'm using 11.8. i wouldn't suggest going beyond 11.8 as torch doesn't support higher than that currently.

rohvani commented 1 year ago

I'm getting this when trying to setup the kernel: The current installed version of g++ (11.3.0) is greater than the maximum required version by CUDA 11.3 (10.0.0). Please make sure to use an adequate version of g++ (>=5.0.0, <=10.0.0).

One way to get past that is to update your CUDA version to 11.6 or 11.7

I am still however stuck on the ModuleNotFoundError: No module named 'gptq'

Feels like trying to get various early optimizations of Stable Diffusion running back in the day haha, good times.

Here is the change I made to my models.py, note the sys.path.insert call.

    if(shared.args.load_in_4bit):
        print('loading 4 bit')
        import sys
        sys.path.insert(1, './gptqllama')
        from gptqllama import llama
        model = llama.load_quant('/root/text-generation-webui/models/LLaMA-13B/', '/root/text-generation-webui/models/LLaMA-13B/llama13b-4bit.pt', 4)
        model=model.to(torch.device('cuda:0'))
        tokenizer = LLaMATokenizer.from_pretrained("/root/text-generation-webui/models/LLaMA-13B/", device_map='auto')
        return model, tokenizer
oobabooga commented 1 year ago

I had success using the 4bit models from this magnet: magnet:?

I don't think it's safe to download random pytorch models by torrent.

i just pushed fresh conversions to the hub, under decapoda-research

I can't find a new conversion for llama-7b @zoidbb. Will it appear here? https://huggingface.co/decapoda-research/llama-7b-hf-int4/tree/main

dustydecapod commented 1 year ago

@oobabooga you're a few minutes ahead of me, its still uploading :P I'm pre-populating the repos while the conversions finish.

devilismyfriend commented 1 year ago

you need a newer version of CUDA. i'm using 11.8. i wouldn't suggest going beyond 11.8 as torch doesn't support higher than that currently.

thanks :)

dustydecapod commented 1 year ago

@oobabooga 7b is done, 13b and larger are still baking.

jtang613 commented 1 year ago

@David-337 I've uploaded my "GPTQ Janky AF" (but working) code to my fork at https://github.com/jtang613/text-generation-webui if you want to take a peek. It's entirely based on other people's hard work. I just put it in one place.

oobabooga commented 1 year ago

7b is done, 13b and larger are still baking.

Thanks!! I have tested your new llama-7b-4bit.pt and it worked. I have generated the reply below using 4979MiB VRAM:

4-bit

devilismyfriend commented 1 year ago

@David-337 I've uploaded my "GPTQ Janky AF" (but working) code to my fork at https://github.com/jtang613/text-generation-webui if you want to take a peek. It's entirely based on other people's hard work. I just put it in one place.

also seems like you hard coded the model path if(shared.args.load_in_4bit): print("Loading GPTQ ...") model = llama.load_quant("/mnt/data/ml/oobabooga/llama-13b/", "/mnt/data/ml/oobabooga/llama-13b/llama13b-4bit.pt", 4) model = model.to(torch.device('cuda:0'))

Titaniumtown commented 1 year ago

Seems this branch was just made, exciting! https://github.com/oobabooga/text-generation-webui/tree/llama-4bit

oobabooga commented 1 year ago

Here is a cleaned up PR with instructions:

https://github.com/oobabooga/text-generation-webui/pull/206

I will wait for this to be resolved before merging:

https://github.com/huggingface/transformers/pull/21955#issuecomment-1462540212

jtang613 commented 1 year ago

@David-337 I've uploaded my "GPTQ Janky AF" (but working) code to my fork at https://github.com/jtang613/text-generation-webui if you want to take a peek. It's entirely based on other people's hard work. I just put it in one place.

also seems like you hard coded the model path if(shared.args.load_in_4bit): print("Loading GPTQ ...") model = llama.load_quant("/mnt/data/ml/oobabooga/llama-13b/", "/mnt/data/ml/oobabooga/llama-13b/llama13b-4bit.pt", 4) model = model.to(torch.device('cuda:0'))

You missed the "Janky AF" and "for reference only" part. Best to wait for official support.

Titaniumtown commented 1 year ago

@jtang613 #206

MarkSchmidty commented 1 year ago

dumped the entirety of llama.py (other than main) into models.py because I didnt care about code that didnt do anything. I copied all of the .py files from the GPTQ repo into the modules folder for the same reason. I added the --load-in-4bit launch arg to shared.py.

For reproducibility, can you post your conversion process? Is there a script you used to produce the 4bit models?

generic-username0718 commented 1 year ago

dumped the entirety of llama.py (other than main) into models.py because I didnt care about code that didnt do anything. I copied all of the .py files from the GPTQ repo into the modules folder for the same reason. I added the --load-in-4bit launch arg to shared.py.

For reproducibility, can you post your conversion process? Is there a script you used to produce the 4bit models?

Pretty sure the conversation has moved here https://github.com/oobabooga/text-generation-webui/pull/206

ItsLogic commented 1 year ago

dumped the entirety of llama.py (other than main) into models.py because I didnt care about code that didnt do anything. I copied all of the .py files from the GPTQ repo into the modules folder for the same reason. I added the --load-in-4bit launch arg to shared.py.

For reproducibility, can you post your conversion process? Is there a script you used to produce the 4bit models?

just like generic said you should now be using #206 for 4bit quant. The instructions in the GPTQ repo were used to convert the models to 4bit .pt files

oobabooga commented 1 year ago

I'm eagerly waiting for @zoidbb to upload the 4-bit version of the 30B model 😃

generic-username0718 commented 1 year ago

I'm eagerly waiting for @zoidbb to upload the 4-bit version of the 30B model smiley

plz sir I'd like some more multi-gpu support

oliver-twist-0071

qwopqwop200 commented 1 year ago

간절히 기다리고 있어@zoidbb30B 모델의 4비트 버전 업로드😃

I'm eagerly waiting for @zoidbb to upload the 4-bit version of the 30B model 😃

Someone uploaded magnet with all 7,13,33,65b. I've confirmed that 33B works.

https://rentry.org/llama-tard-v2#bonus-3-convert-the-weights-yourself-optional-recommended

Titaniumtown commented 1 year ago

@qwopqwop200 is that magnet link based off the new or the old huggingface weights (idk the terminology but there was a change in the format used in the LLaMA PR)

qwopqwop200 commented 1 year ago

@qwopqwop200 is that magnet link based off the new or the old huggingface weights (idk the terminology but there was a change in the format used in the LLaMA PR)

it seems to work just fine. Probably based on a new PR. It shows an impressive result of 4.59 ppl at 4 bit 33b.

IdiotSandwichTheThird commented 1 year ago

@qwopqwop200 I'm seeing higher than expected memory usage with this at full context. For 30B, it starts at 20GB of vram when generation starts, then slowly climbs to nearly 24gb before cuda OOM. Can anyone confirm this weird behavior? python 23937MiB <- python process in nvidia-smi just before crash OOM

MarcusRobbins commented 1 year ago

4bit multi-gpu support? I wanna run 65B!!! (I've looked at your commits, my god you are working hard, thanks for everything :) )

deepdiffuser commented 1 year ago

multigpu support here, tested on two 3090s

https://github.com/oobabooga/text-generation-webui/pull/219

BugReporterZ commented 1 year ago

Does the 4-bit quantization here use floating point or integer values? It appears that optimal results in accuracy are achieved with FP4 rather than INT4.

image (From https://arxiv.org/pdf/2212.09720.pdf)

qwopqwop200 commented 1 year ago

Looking at figure 5 in the paper, gptq beats float.

qwopqwop200 commented 1 year ago

@qwopqwop200 I'm seeing higher than expected memory usage with this at full context. For 30B, it starts at 20GB of vram when generation starts, then slowly climbs to nearly 24gb before cuda OOM. Can anyone confirm this weird behavior? python 23937MiB <- python process in nvidia-smi just before crash OOM

I had a similar experience, but it doesn't seem to be a problem because OOM does not occur based on 2048 tokens.

ItsLogic commented 1 year ago

@qwopqwop200 I'm seeing higher than expected memory usage with this at full context. For 30B, it starts at 20GB of vram when generation starts, then slowly climbs to nearly 24gb before cuda OOM. Can anyone confirm this weird behavior? python 23937MiB <- python process in nvidia-smi just before crash OOM

I can confirm it too. I OOM when I try to generate text with an input longer than 1500 tokens. Seems for 24G gpus we might need to reduce max context to around 1500 instead of 2048

MetaIX commented 1 year ago

That’s a bit strange, 4-bit takes about half a GB per B. So the whole model (which is 33B) should fit in about 16.5 GB of VRAM. You should have ~5 GB of VRAM leftover for context (deducted 2GB due to extra processes.) I wonder if the context is really causing the OOM or if it’s something else, assuming you’re on windows and you have 24GB VRAM

IdiotSandwichTheThird commented 1 year ago

That’s a bit strange, 4-bit takes about half a GB per B. So the whole model (which is 33B) should fit in about 16.5 GB of VRAM. You should have ~5 GB of VRAM leftover for context (deducted 2GB due to extra processes.) I wonder if the context is really causing the OOM or if it’s something else, assuming you’re on windows and you have 24GB VRAM

Nah, it's the same on linux as well.

dustydecapod commented 1 year ago

I’m getting some very odd behaviors on 7b 4-bit from time to time, it seems to exacerbate llama’s tendency to go off the rails. I’m looking into this more deeply, but I think a more comprehensive calibration technique might be necessary to really make this performant in the capacity we’d like.

Titaniumtown commented 1 year ago

@qwopqwop200 the weights are faulty: https://rentry.org/llama-tard-v2

dustydecapod commented 1 year ago

@Titaniumtown which ones? the torrent ones? or the ones I posted through decapoda-research?

IdiotSandwichTheThird commented 1 year ago

@Titaniumtown which ones? the torrent ones? or the ones I posted through decapoda-research?

Link talks specifically about the torrent ones. 30B when btw ;-)

dustydecapod commented 1 year ago

@Titaniumtown which ones? the torrent ones? or the ones I posted through decapoda-research?

Link talks specifically about the torrent ones. 30B when btw ;-)

Link was long and too poorly formatted for my mildly hungover eyes ;)

This afternoon I'm looking into revising the conversion methodology to fix some issues I'm seeing with generation, also working on a more thorough objective method to evaluate the quality of conversions. The current evaluation method is limited and doesn't cover the sort of text generation this tool aims to provide.

Once I have that done I'll be re-converting everything from 7b up.

ItsLogic commented 1 year ago

That’s a bit strange, 4-bit takes about half a GB per B. So the whole model (which is 33B) should fit in about 16.5 GB of VRAM. You should have ~5 GB of VRAM leftover for context (deducted 2GB due to extra processes.) I wonder if the context is really causing the OOM or if it’s something else, assuming you’re on windows and you have 24GB VRAM

Im on arch linux. Ill run through some numbers On boot I have 1.2GiB used. After loading the model I have 17.9GiB used. A generation using the word "This" uses 18.5GiB. A generation with about 1500 tokens as input takes me up to 23.2GiB. A generation with just over 1600 tokens as input takes me up to 23.6GiB Finally a generation with about 1800 tokens gives me OOM:

CUDA out of memory. Tried to allocate 26.00 MiB (GPU 0; 23.63 GiB total capacity; 21.58 GiB already allocated; 36.00 MiB free; 22.10 GiB reserved in total by PyTorch)
dustydecapod commented 1 year ago

I also intend to look this weekend at introducing a more efficient attention algorithm, the algorithm implemented in the huggingface PR is one of the most memory-inefficient options out there (dot product, space complexity (memory) is O(n^2)).

There are some really nice options already implemented in the xformers library that could work better, but this is all new to me to figuring out how to actually do the implementation will take me some time as I have to grok several papers.

If there's anyone here familiar with implementing attention, please poke your head in and lend a hand.

dustydecapod commented 1 year ago

That’s a bit strange, 4-bit takes about half a GB per B. So the whole model (which is 33B) should fit in about 16.5 GB of VRAM. You should have ~5 GB of VRAM leftover for context (deducted 2GB due to extra processes.) I wonder if the context is really causing the OOM or if it’s something else, assuming you’re on windows and you have 24GB VRAM

Im on arch linux. Ill run through some numbers On boot I have 1.2GiB used. After loading the model I have 17.9GiB used. A generation using the word "This" uses 18.5GiB. A generation with about 1500 tokens as input takes me up to 23.2GiB. A generation with just over 1600 tokens as input takes me up to 23.6GiB Finally a generation with about 1800 tokens gives me OOM:

CUDA out of memory. Tried to allocate 26.00 MiB (GPU 0; 23.63 GiB total capacity; 21.58 GiB already allocated; 36.00 MiB free; 22.10 GiB reserved in total by PyTorch)

Which model? 7b? 7B should take no more than 4GB on a fresh load with no space allocated for attention. Can you share how you're starting server.py?

1800 tokens is OOMing because dot product attention blows chunks. Literally nobody is using dot product in production for any LLM model...

ItsLogic commented 1 year ago

Which model? 7b? 7B should take no more than 4GB on a fresh load with no space allocated for attention. Can you share how you're starting server.py?

1800 tokens is OOMing because dot product attention blows chunks. Literally nobody is using dot product in production for any LLM model...

30B

dustydecapod commented 1 year ago

30B

Ah ya, ok that sounds like the right amount of usage then.

jtang613 commented 1 year ago

Link was long and too poorly formatted for my mildly hungover eyes ;)

fwiw Looks like @Titaniumtown accidentally posted the edit link, the 'view' link is much more readable: https://rentry.org/llama-tard-v2