turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.62k stars 277 forks source link

Llama 70B model with 2.5bpw produced weird responses #123

Closed tomleung1996 closed 9 months ago

tomleung1996 commented 1 year ago

Hi, I am new to LLM quantization and wanted to use text-generation-webui to run a local LLM chatbot.

ExLlamaV2 worked great when I applied it to the 13B model with 8bpw, but when it came to the 70B model with 2.5bpw, I got strange results as below: image

I think this was not the problem of the web UI because it also occurred while using examples/chat.py.

image

fgdfgfthgr-fox commented 1 year ago

Are you using ROcM?

tomleung1996 commented 1 year ago

Are you using ROcM?

No, I am using an RTX4090 with CUDA 11.8.

turboderp commented 1 year ago

Which model is this, and are you sure you're using the right prompt format? From the screenshot it looks like you're running with the Llama format, which includes a BOS token at the beginning of each prompt. Some models go bonkers when they see that. TGW has the option to not include the BOS, and the CLI chatbot has a number of other prompt formats you can try (run with -modes to get a list of options.)

tomleung1996 commented 1 year ago

Which model is this, and are you sure you're using the right prompt format? From the screenshot it looks like you're running with the Llama format, which includes a BOS token at the beginning of each prompt. Some models go bonkers when they see that. TGW has the option to not include the BOS, and the CLI chatbot has a number of other prompt formats you can try (run with -modes to get a list of options.)

I am using Llama2-70B-chat-hf model. I was using ExLlamav2_HF model loader, with Add the bos_token to the beginning of prompts unchecked in the parameters tab, and chatting in the chat tab using chat-instruct mode. However, my problem still persists.

Which mode should I use for the quantization model?

turboderp commented 1 year ago

If it's Llama2-chat, you should use the llama mode. There are some conversions here of Llama2-70B-chat that you can compare against to see if the problem is with the quantized model or if something's going wrong in inference.

oobabooga commented 1 year ago

I have reproduced this issue with a Xwin-LM-70B-V0.1-EXL2-2.500b that I created, and the solution of not using the BOS token worked for me:

With BOS Without BOS
bos2 bos

In both cases, the correct prompt format is used (Vicuna-v1.1).

Interestingly, I also created a Llama-2-70b-chat-EXL2-2.500b and it generated coherent outputs without removing the BOS.

For both quants, I used this parquet file and the following conversion command:

python convert.py \
  -i ../Xwin-LM_Xwin-LM-70B-V0.1_safetensors \
  -o ~/working \
  -cf Xwin-LM-70B-V0.1-EXL2-2.500b \
  -c Evol-Instruct-Code-80k-v1.parquet \
  -b 2.500
turboderp commented 1 year ago

It makes sense for Llama2-chat that it would work with the BOS token, since the prompt format includes it on every round. For that Xwin model (and however many others) it'll be down to how it's been finetuned. There isn't really a "correct" way to format training examples, so as long as people publishing models don't include some examples of fully formatted training data (with the resulting encoding), we end up having to guess.

Here, I would speculate that the finetuning was somewhat aggressive and maybe used <s> in an unusual way. Maybe every training example had <s> followed by a newline token or some such, and now the model gets wildly confused whenever <s> is followed by anything else. Who can say.

To make matters worse, depending on how much the model has been finetuned (learning rate, number of epochs and so on) it may still output coherent text even if it isn't being prompted correctly, it just won't necessarily behave as intended. E.g. it might adhere to the prompt format most of the time, then suddenly deviate from it for no apparent reason.

oobabooga commented 1 year ago

Here, I would speculate that the finetuning was somewhat aggressive and maybe used <s> in an unusual way. Maybe every training example had <s> followed by a newline token or some such, and now the model gets wildly confused whenever <s> is followed by anything else. Who can say.

That would make sense. What can be done then is to let people know about the need to remove the BOS token. I have uploaded the two quants that I mentioned to HF and added a small note to the README about this issue:

https://huggingface.co/oobabooga/Llama-2-70b-chat-EXL2-2.500b

https://huggingface.co/oobabooga/Xwin-LM-70B-V0.1-EXL2-2.500b

tomleung1996 commented 1 year ago

If it's Llama2-chat, you should use the llama mode. There are some conversions here of Llama2-70B-chat that you can compare against to see if the problem is with the quantized model or if something's going wrong in inference.

The conversions you provided worked for me (without removing the BOS)! There could be something wrong in my quantization process, but I did manage to quantize a 13B model and ran it with no problem.

I also noticed a difference that TGW can auto-detect the model loader for your conversions but not mine (even for the 13B model).

RandomInternetPreson commented 1 year ago

Hey All, just another data point to add to the discussion. I quantized Xwin-LM-70B-V0.1 with 8.000 bit and it worked without needing to do anything to the default Oob text gen webui settings (BOS is being used). I used the same model and training data as documented in Oobabooga's post, I did change the last layer bit precision to 8 instead of the default 6.

I'll be working on llama2-70B-chat conversion later today with the same 8.000 bit setup.

greyfoxone commented 11 months ago

I have reproduced this issue with a Xwin-LM-70B-V0.1-EXL2-2.500b that I created, and the solution of not using the BOS token worked for me: With BOS Without BOS bos2 bos

In both cases, the correct prompt format is used (Vicuna-v1.1).

Interestingly, I also created a Llama-2-70b-chat-EXL2-2.500b and it generated coherent outputs without removing the BOS.

For both quants, I used this parquet file and the following conversion command:

python convert.py \
  -i ../Xwin-LM_Xwin-LM-70B-V0.1_safetensors \
  -o ~/working \
  -cf Xwin-LM-70B-V0.1-EXL2-2.500b \
  -c Evol-Instruct-Code-80k-v1.parquet \
  -b 2.500

I have the exact same issue. I downloaded turboderp_Llama2-70B-exl2 2.5 bpw. I have a 3090 24GB. The LLM is hung up on the word "cord" mostly: image How do I fix this I do not understand your comment completely. I am new to this but my understanding is that the LLM gets confused by the BOS token and that there is am option to disable it? Were do I find the option in the oobabooga gui ?

turboderp commented 11 months ago

It's under parameters->generation, on the right-hand side of the page.

image

turboderp commented 9 months ago

Closing this now. For reference the new measurement procedure seems to have fixed the cord cord stringbuilder issue on low bitrate models, regardless of BOS settings.