The results are very different in 4-bit and 16-bit/8-bit modes

oobabooga commented 1 year ago

Using this prompt:

Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Write one paragraph
### Response:

I get these results for the tloen/alpaca-lora-7b LoRA applied on top of llama-7b:

Mode	Result
4-bit	The world is a vast and mysterious place, filled with wonders and surprises. We are all connected to each other and to the universe, and we have the power to shape our own destinies. We can choose to follow our own paths, or to follow the paths of others. We can choose to be brave and take risks, or to be cautious and play it safe. We can choose to be kind and compassionate, or to be selfish and cruel. We can choose to be creative and innovative, or to be complacent and unambitious. We can choose to be optimistic and hopeful, or to be pessimistic and hopeless. We can choose to be open-minded and accepting, or to be closed-minded and judgmental. We can choose to be generous and kind, or to be selfish and greedy. We can choose to be humble and respectful, or
16-bit	The world is a beautiful place, full of wonders and surprises. Every day, we are presented with new opportunities to explore and discover. We can learn new things, make new friends, and experience new cultures. Life is a journey, and it is up to us to make the most of it.
8-bit	The world is a beautiful place, full of wonders and surprises. From the majestic mountains to the deep blue oceans, there is so much to explore and discover. Nature is full of surprises, from the majestic beauty of a sunrise to the majestic beauty of a sunset. The world is full of surprises, and it is up to us to take advantage of them and make the most of our lives.

In all cases, the generation uses do_sample=False for greedy sampling. The 4-bit model used is llama-7b-4bit-128g.

The code that I am using is the one in this PR https://github.com/oobabooga/text-generation-webui/pull/1200

Is this difference something to worry about? In all my tests, the 4-bit results diverge a lot from the 16-bit/8-bit results.

johnsmith0031 commented 1 year ago

I think it's because of 4-bit quantization loss, so actually they're different models, especially for 4-bit model considering the fact that it is not RTN based. Current SOTA 4-bit model reach 5.85 on wikitext2 while the fp16 version is 5.68, still better than the quantized model.

LLaMA-7B	Bits	group-size	memory(MiB)	Wikitext2	checkpoint size(GB)
FP16	16	-	13940	5.68	12.5
RTN	4	-	-	6.29	-
GPTQ	4	-	4740	6.09	3.5
GPTQ	4	128	4891	5.85	3.6
RTN	3	-	-	25.54	-
GPTQ	3	-	3852	8.07	2.7
GPTQ	3	128	4116	6.61	3.0

oobabooga commented 1 year ago

they're different models, especially for 4-bit model considering the fact that it is not RTN based

That makes perfect sense, thanks for the reply. I was worried that I might have done something wrong while adapting the web UI to work with the code here. I'll do some final reviewing and will approve/document my own PR.

4-bit+lora seems to be the holy grail of consumer-grade LLM inference at the moment and it's nice to see it working. Thanks for your work.

leexinyu1204 commented 1 year ago

Using this prompt:
Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Write one paragraph
### Response:
I get these results for the tloen/alpaca-lora-7b LoRA applied on top of llama-7b:

Mode Result 4-bit The world is a vast and mysterious place, filled with wonders and surprises. We are all connected to each other and to the universe, and we have the power to shape our own destinies. We can choose to follow our own paths, or to follow the paths of others. We can choose to be brave and take risks, or to be cautious and play it safe. We can choose to be kind and compassionate, or to be selfish and cruel. We can choose to be creative and innovative, or to be complacent and unambitious. We can choose to be optimistic and hopeful, or to be pessimistic and hopeless. We can choose to be open-minded and accepting, or to be closed-minded and judgmental. We can choose to be generous and kind, or to be selfish and greedy. We can choose to be humble and respectful, or 16-bit The world is a beautiful place, full of wonders and surprises. Every day, we are presented with new opportunities to explore and discover. We can learn new things, make new friends, and experience new cultures. Life is a journey, and it is up to us to make the most of it. 8-bit The world is a beautiful place, full of wonders and surprises. From the majestic mountains to the deep blue oceans, there is so much to explore and discover. Nature is full of surprises, from the majestic beauty of a sunrise to the majestic beauty of a sunset. The world is full of surprises, and it is up to us to take advantage of them and make the most of our lives. In all cases, the generation uses do_sample=False for greedy sampling. The 4-bit model used is llama-7b-4bit-128g.

The code that I am using is the one in this PR oobabooga/text-generation-webui#1200

Is this difference something to worry about? In all my tests, the 4-bit results diverge a lot from the 16-bit/8-bit results.

may I ask how to use 8 bit?

johnsmith0031 / alpaca_lora_4bit

The results are very different in 4-bit and 16-bit/8-bit modes #81