Closed oobabooga closed 1 year ago
I think it's because of 4-bit quantization loss, so actually they're different models, especially for 4-bit model considering the fact that it is not RTN based. Current SOTA 4-bit model reach 5.85 on wikitext2 while the fp16 version is 5.68, still better than the quantized model.
LLaMA-7B | Bits | group-size | memory(MiB) | Wikitext2 | checkpoint size(GB) |
---|---|---|---|---|---|
FP16 | 16 | - | 13940 | 5.68 | 12.5 |
RTN | 4 | - | - | 6.29 | - |
GPTQ | 4 | - | 4740 | 6.09 | 3.5 |
GPTQ | 4 | 128 | 4891 | 5.85 | 3.6 |
RTN | 3 | - | - | 25.54 | - |
GPTQ | 3 | - | 3852 | 8.07 | 2.7 |
GPTQ | 3 | 128 | 4116 | 6.61 | 3.0 |
they're different models, especially for 4-bit model considering the fact that it is not RTN based
That makes perfect sense, thanks for the reply. I was worried that I might have done something wrong while adapting the web UI to work with the code here. I'll do some final reviewing and will approve/document my own PR.
4-bit+lora seems to be the holy grail of consumer-grade LLM inference at the moment and it's nice to see it working. Thanks for your work.
Using this prompt:
Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: Write one paragraph ### Response:
I get these results for the
tloen/alpaca-lora-7b
LoRA applied on top of llama-7b:Mode Result 4-bit The world is a vast and mysterious place, filled with wonders and surprises. We are all connected to each other and to the universe, and we have the power to shape our own destinies. We can choose to follow our own paths, or to follow the paths of others. We can choose to be brave and take risks, or to be cautious and play it safe. We can choose to be kind and compassionate, or to be selfish and cruel. We can choose to be creative and innovative, or to be complacent and unambitious. We can choose to be optimistic and hopeful, or to be pessimistic and hopeless. We can choose to be open-minded and accepting, or to be closed-minded and judgmental. We can choose to be generous and kind, or to be selfish and greedy. We can choose to be humble and respectful, or 16-bit The world is a beautiful place, full of wonders and surprises. Every day, we are presented with new opportunities to explore and discover. We can learn new things, make new friends, and experience new cultures. Life is a journey, and it is up to us to make the most of it. 8-bit The world is a beautiful place, full of wonders and surprises. From the majestic mountains to the deep blue oceans, there is so much to explore and discover. Nature is full of surprises, from the majestic beauty of a sunrise to the majestic beauty of a sunset. The world is full of surprises, and it is up to us to take advantage of them and make the most of our lives. In all cases, the generation uses
do_sample=False
for greedy sampling. The 4-bit model used isllama-7b-4bit-128g
.The code that I am using is the one in this PR oobabooga/text-generation-webui#1200
Is this difference something to worry about? In all my tests, the 4-bit results diverge a lot from the 16-bit/8-bit results.
may I ask how to use 8 bit?
Using this prompt:
I get these results for the
tloen/alpaca-lora-7b
LoRA applied on top of llama-7b:In all cases, the generation uses
do_sample=False
for greedy sampling. The 4-bit model used isllama-7b-4bit-128g
.The code that I am using is the one in this PR https://github.com/oobabooga/text-generation-webui/pull/1200
Is this difference something to worry about? In all my tests, the 4-bit results diverge a lot from the 16-bit/8-bit results.