Closed qwopqwop200 closed 1 year ago
@ItsLogic Thanks, according to your steps I get 13B working on my 3080Ti. However I find the response time for chat mode is very slow. It takes a long time to start loading up the gpu and generate text. As a chatbot this is annoying. On the other hand in notebook mode it works like a charm. Wondering why.
A library would be nice.
I have tried loading the model without success so far. Here is what I did:
1. Install the updated pull request code:
pip uninstall transformers pip install git+https://github.com/zphang/transformers@llama_push
2. Re-convert LLaMA-7b using the updated [convert_llama_weights_to_hf.py](https://github.com/zphang/transformers/blob/llama_push/src/transformers/models/llama/convert_llama_weights_to_hf.py) and put that into the `models/llama-7b-new` folder. 3. Put this file into the `models` folder: https://huggingface.co/decapoda-research/llama-smallint-pt/resolve/main/llama-7b-4bit.pt 4. Load the model with
model = load_quant("models/llama-7b-new", "models/llama-7b-4bit.pt", 4) model=model.to(torch.device('cuda:0'))
I got this error:
Unexpected key(s) in state_dict: "model.decoder.embed_tokens.weight", "model.decoder.layers.0.self_attn.q_proj.zeros", "model.decoder.layers.0.self_attn.q_proj.scales", "model.decoder.layers.0.self_attn.q_proj.bias", "model.decoder.layers.0.self_attn.q_proj.qweight", "model.decoder.layers.0.self_attn.k_proj.zeros", "model.decoder.layers.0.self_attn.k_proj.scales", "model.decoder.layers.0.self_attn.k_proj.bias", "model.decoder.layers.0.self_attn.k_proj.qweight (...)
Any idea what I am doing wrong?
I ran into this issue as well when I used the originally released pytorch converted files from the facebook leak .
I had to reconvert all the models from the original facebook leak -- it seems that the 16bit HF models that were published previously, were converted without the transformer
changes which changed (or added?) handling of these additional layer properties.
@sgsdxzy You might not be using the GPTQ loader. I ran into that also and had to "if True" the line 45. Put a print() above your load_quant() to see if it's being reaached or not.
@rohvani i just pushed fresh conversions to the hub, under decapoda-research. the transformers lib's changes have been constantly evolving, it takes a lot of time to keep up -- especially since im publishing all 4 model sizes.
Awesome, thanks @zoidbb 😄 your HF repos have been a huge help! Is there a discord/chat for decapoda-research
?
not currently. if i end up having time to dedicate to this more fully, I might put something together. right now i'm doing this as a fun little hobby during sabbatical.
I'm getting this when trying to setup the kernel:
The current installed version of g++ (11.3.0) is greater than the maximum required version by CUDA 11.3 (10.0.0). Please make sure to use an adequate version of g++ (>=5.0.0, <=10.0.0).
@sgsdxzy You might not be using the GPTQ loader. I ran into that also and had to "if True" the line 45. Put a print() above your load_quant() to see if it's being reaached or not.
I can make sure model=llama.load_quant(...
is executed. Otherwise the original 13B model cannot fit into a 3080Ti.
I'm getting this when trying to setup the kernel:
The current installed version of g++ (11.3.0) is greater than the maximum required version by CUDA 11.3 (10.0.0). Please make sure to use an adequate version of g++ (>=5.0.0, <=10.0.0).
One way to get past that is to update your CUDA version to 11.6 or 11.7
I am still however stuck on the ModuleNotFoundError: No module named 'gptq'
Feels like trying to get various early optimizations of Stable Diffusion running back in the day haha, good times.
I'm getting this when trying to setup the kernel:
The current installed version of g++ (11.3.0) is greater than the maximum required version by CUDA 11.3 (10.0.0). Please make sure to use an adequate version of g++ (>=5.0.0, <=10.0.0).
you need a newer version of CUDA. i'm using 11.8. i wouldn't suggest going beyond 11.8 as torch doesn't support higher than that currently.
I'm getting this when trying to setup the kernel: The current installed version of g++ (11.3.0) is greater than the maximum required version by CUDA 11.3 (10.0.0). Please make sure to use an adequate version of g++ (>=5.0.0, <=10.0.0).
One way to get past that is to update your CUDA version to 11.6 or 11.7
I am still however stuck on the
ModuleNotFoundError: No module named 'gptq'
Feels like trying to get various early optimizations of Stable Diffusion running back in the day haha, good times.
Here is the change I made to my models.py
, note the sys.path.insert
call.
if(shared.args.load_in_4bit):
print('loading 4 bit')
import sys
sys.path.insert(1, './gptqllama')
from gptqllama import llama
model = llama.load_quant('/root/text-generation-webui/models/LLaMA-13B/', '/root/text-generation-webui/models/LLaMA-13B/llama13b-4bit.pt', 4)
model=model.to(torch.device('cuda:0'))
tokenizer = LLaMATokenizer.from_pretrained("/root/text-generation-webui/models/LLaMA-13B/", device_map='auto')
return model, tokenizer
I had success using the 4bit models from this magnet: magnet:?
I don't think it's safe to download random pytorch models by torrent.
i just pushed fresh conversions to the hub, under decapoda-research
I can't find a new conversion for llama-7b @zoidbb. Will it appear here? https://huggingface.co/decapoda-research/llama-7b-hf-int4/tree/main
@oobabooga you're a few minutes ahead of me, its still uploading :P I'm pre-populating the repos while the conversions finish.
you need a newer version of CUDA. i'm using 11.8. i wouldn't suggest going beyond 11.8 as torch doesn't support higher than that currently.
thanks :)
@oobabooga 7b is done, 13b and larger are still baking.
@David-337 I've uploaded my "GPTQ Janky AF" (but working) code to my fork at https://github.com/jtang613/text-generation-webui if you want to take a peek. It's entirely based on other people's hard work. I just put it in one place.
7b is done, 13b and larger are still baking.
Thanks!! I have tested your new llama-7b-4bit.pt
and it worked. I have generated the reply below using 4979MiB
VRAM:
@David-337 I've uploaded my "GPTQ Janky AF" (but working) code to my fork at https://github.com/jtang613/text-generation-webui if you want to take a peek. It's entirely based on other people's hard work. I just put it in one place.
also seems like you hard coded the model path if(shared.args.load_in_4bit): print("Loading GPTQ ...") model = llama.load_quant("/mnt/data/ml/oobabooga/llama-13b/", "/mnt/data/ml/oobabooga/llama-13b/llama13b-4bit.pt", 4) model = model.to(torch.device('cuda:0'))
Seems this branch was just made, exciting! https://github.com/oobabooga/text-generation-webui/tree/llama-4bit
Here is a cleaned up PR with instructions:
https://github.com/oobabooga/text-generation-webui/pull/206
I will wait for this to be resolved before merging:
https://github.com/huggingface/transformers/pull/21955#issuecomment-1462540212
@David-337 I've uploaded my "GPTQ Janky AF" (but working) code to my fork at https://github.com/jtang613/text-generation-webui if you want to take a peek. It's entirely based on other people's hard work. I just put it in one place.
also seems like you hard coded the model path if(shared.args.load_in_4bit): print("Loading GPTQ ...") model = llama.load_quant("/mnt/data/ml/oobabooga/llama-13b/", "/mnt/data/ml/oobabooga/llama-13b/llama13b-4bit.pt", 4) model = model.to(torch.device('cuda:0'))
You missed the "Janky AF" and "for reference only" part. Best to wait for official support.
@jtang613 #206
dumped the entirety of
llama.py
(other than main) intomodels.py
because I didnt care about code that didnt do anything. I copied all of the .py files from the GPTQ repo into themodules
folder for the same reason. I added the --load-in-4bit launch arg toshared.py
.
For reproducibility, can you post your conversion process? Is there a script you used to produce the 4bit models?
dumped the entirety of
llama.py
(other than main) intomodels.py
because I didnt care about code that didnt do anything. I copied all of the .py files from the GPTQ repo into themodules
folder for the same reason. I added the --load-in-4bit launch arg toshared.py
.For reproducibility, can you post your conversion process? Is there a script you used to produce the 4bit models?
Pretty sure the conversation has moved here https://github.com/oobabooga/text-generation-webui/pull/206
dumped the entirety of
llama.py
(other than main) intomodels.py
because I didnt care about code that didnt do anything. I copied all of the .py files from the GPTQ repo into themodules
folder for the same reason. I added the --load-in-4bit launch arg toshared.py
.For reproducibility, can you post your conversion process? Is there a script you used to produce the 4bit models?
just like generic said you should now be using #206 for 4bit quant. The instructions in the GPTQ repo were used to convert the models to 4bit .pt files
I'm eagerly waiting for @zoidbb to upload the 4-bit version of the 30B model 😃
I'm eagerly waiting for @zoidbb to upload the 4-bit version of the 30B model smiley
plz sir I'd like some more multi-gpu support
간절히 기다리고 있어@zoidbb30B 모델의 4비트 버전 업로드😃
I'm eagerly waiting for @zoidbb to upload the 4-bit version of the 30B model 😃
Someone uploaded magnet with all 7,13,33,65b. I've confirmed that 33B works.
https://rentry.org/llama-tard-v2#bonus-3-convert-the-weights-yourself-optional-recommended
@qwopqwop200 is that magnet link based off the new or the old huggingface weights (idk the terminology but there was a change in the format used in the LLaMA PR)
@qwopqwop200 is that magnet link based off the new or the old huggingface weights (idk the terminology but there was a change in the format used in the LLaMA PR)
it seems to work just fine. Probably based on a new PR. It shows an impressive result of 4.59 ppl at 4 bit 33b.
@qwopqwop200
I'm seeing higher than expected memory usage with this at full context. For 30B, it starts at 20GB of vram when generation starts, then slowly climbs to nearly 24gb before cuda OOM. Can anyone confirm this weird behavior?
python 23937MiB
<- python process in nvidia-smi just before crash OOM
4bit multi-gpu support? I wanna run 65B!!! (I've looked at your commits, my god you are working hard, thanks for everything :) )
multigpu support here, tested on two 3090s
Does the 4-bit quantization here use floating point or integer values? It appears that optimal results in accuracy are achieved with FP4 rather than INT4.
Looking at figure 5 in the paper, gptq beats float.
@qwopqwop200 I'm seeing higher than expected memory usage with this at full context. For 30B, it starts at 20GB of vram when generation starts, then slowly climbs to nearly 24gb before cuda OOM. Can anyone confirm this weird behavior?
python 23937MiB
<- python process in nvidia-smi just before crash OOM
I had a similar experience, but it doesn't seem to be a problem because OOM does not occur based on 2048 tokens.
@qwopqwop200 I'm seeing higher than expected memory usage with this at full context. For 30B, it starts at 20GB of vram when generation starts, then slowly climbs to nearly 24gb before cuda OOM. Can anyone confirm this weird behavior?
python 23937MiB
<- python process in nvidia-smi just before crash OOM
I can confirm it too. I OOM when I try to generate text with an input longer than 1500 tokens. Seems for 24G gpus we might need to reduce max context to around 1500 instead of 2048
That’s a bit strange, 4-bit takes about half a GB per B. So the whole model (which is 33B) should fit in about 16.5 GB of VRAM. You should have ~5 GB of VRAM leftover for context (deducted 2GB due to extra processes.) I wonder if the context is really causing the OOM or if it’s something else, assuming you’re on windows and you have 24GB VRAM
That’s a bit strange, 4-bit takes about half a GB per B. So the whole model (which is 33B) should fit in about 16.5 GB of VRAM. You should have ~5 GB of VRAM leftover for context (deducted 2GB due to extra processes.) I wonder if the context is really causing the OOM or if it’s something else, assuming you’re on windows and you have 24GB VRAM
Nah, it's the same on linux as well.
I’m getting some very odd behaviors on 7b 4-bit from time to time, it seems to exacerbate llama’s tendency to go off the rails. I’m looking into this more deeply, but I think a more comprehensive calibration technique might be necessary to really make this performant in the capacity we’d like.
@qwopqwop200 the weights are faulty: https://rentry.org/llama-tard-v2
@Titaniumtown which ones? the torrent ones? or the ones I posted through decapoda-research?
@Titaniumtown which ones? the torrent ones? or the ones I posted through decapoda-research?
Link talks specifically about the torrent ones. 30B when btw ;-)
@Titaniumtown which ones? the torrent ones? or the ones I posted through decapoda-research?
Link talks specifically about the torrent ones. 30B when btw ;-)
Link was long and too poorly formatted for my mildly hungover eyes ;)
This afternoon I'm looking into revising the conversion methodology to fix some issues I'm seeing with generation, also working on a more thorough objective method to evaluate the quality of conversions. The current evaluation method is limited and doesn't cover the sort of text generation this tool aims to provide.
Once I have that done I'll be re-converting everything from 7b up.
That’s a bit strange, 4-bit takes about half a GB per B. So the whole model (which is 33B) should fit in about 16.5 GB of VRAM. You should have ~5 GB of VRAM leftover for context (deducted 2GB due to extra processes.) I wonder if the context is really causing the OOM or if it’s something else, assuming you’re on windows and you have 24GB VRAM
Im on arch linux. Ill run through some numbers On boot I have 1.2GiB used. After loading the model I have 17.9GiB used. A generation using the word "This" uses 18.5GiB. A generation with about 1500 tokens as input takes me up to 23.2GiB. A generation with just over 1600 tokens as input takes me up to 23.6GiB Finally a generation with about 1800 tokens gives me OOM:
CUDA out of memory. Tried to allocate 26.00 MiB (GPU 0; 23.63 GiB total capacity; 21.58 GiB already allocated; 36.00 MiB free; 22.10 GiB reserved in total by PyTorch)
I also intend to look this weekend at introducing a more efficient attention algorithm, the algorithm implemented in the huggingface PR is one of the most memory-inefficient options out there (dot product, space complexity (memory) is O(n^2)).
There are some really nice options already implemented in the xformers library that could work better, but this is all new to me to figuring out how to actually do the implementation will take me some time as I have to grok several papers.
If there's anyone here familiar with implementing attention, please poke your head in and lend a hand.
That’s a bit strange, 4-bit takes about half a GB per B. So the whole model (which is 33B) should fit in about 16.5 GB of VRAM. You should have ~5 GB of VRAM leftover for context (deducted 2GB due to extra processes.) I wonder if the context is really causing the OOM or if it’s something else, assuming you’re on windows and you have 24GB VRAM
Im on arch linux. Ill run through some numbers On boot I have 1.2GiB used. After loading the model I have 17.9GiB used. A generation using the word "This" uses 18.5GiB. A generation with about 1500 tokens as input takes me up to 23.2GiB. A generation with just over 1600 tokens as input takes me up to 23.6GiB Finally a generation with about 1800 tokens gives me OOM:
CUDA out of memory. Tried to allocate 26.00 MiB (GPU 0; 23.63 GiB total capacity; 21.58 GiB already allocated; 36.00 MiB free; 22.10 GiB reserved in total by PyTorch)
Which model? 7b? 7B should take no more than 4GB on a fresh load with no space allocated for attention. Can you share how you're starting server.py?
1800 tokens is OOMing because dot product attention blows chunks. Literally nobody is using dot product in production for any LLM model...
Which model? 7b? 7B should take no more than 4GB on a fresh load with no space allocated for attention. Can you share how you're starting server.py?
1800 tokens is OOMing because dot product attention blows chunks. Literally nobody is using dot product in production for any LLM model...
30B
30B
Ah ya, ok that sounds like the right amount of usage then.
Link was long and too poorly formatted for my mildly hungover eyes ;)
fwiw Looks like @Titaniumtown accidentally posted the edit link, the 'view' link is much more readable: https://rentry.org/llama-tard-v2
GPTQ is currently the SOTA one shot quantization method for LLMs. GPTQ supports amazingly low 3-bit and 4-bit weight quantization. And it can be applied to LLaMa. I've actually confirmed that this works well in LLaMa 7b. I haven't tested the memory usage(n-bit cuda kernel), but I think it should work.
code: https://github.com/qwopqwop200/GPTQ-for-LLaMa