Closed tomleung1996 closed 9 months ago
Are you using ROcM?
Are you using ROcM?
No, I am using an RTX4090 with CUDA 11.8.
Which model is this, and are you sure you're using the right prompt format? From the screenshot it looks like you're running with the Llama format, which includes a BOS token at the beginning of each prompt. Some models go bonkers when they see that. TGW has the option to not include the BOS, and the CLI chatbot has a number of other prompt formats you can try (run with -modes
to get a list of options.)
Which model is this, and are you sure you're using the right prompt format? From the screenshot it looks like you're running with the Llama format, which includes a BOS token at the beginning of each prompt. Some models go bonkers when they see that. TGW has the option to not include the BOS, and the CLI chatbot has a number of other prompt formats you can try (run with
-modes
to get a list of options.)
I am using Llama2-70B-chat-hf model
. I was using ExLlamav2_HF
model loader, with Add the bos_token to the beginning of prompts
unchecked in the parameters tab, and chatting in the chat tab using chat-instruct
mode. However, my problem still persists.
Which mode
should I use for the quantization model?
If it's Llama2-chat, you should use the llama
mode. There are some conversions here of Llama2-70B-chat that you can compare against to see if the problem is with the quantized model or if something's going wrong in inference.
I have reproduced this issue with a Xwin-LM-70B-V0.1-EXL2-2.500b
that I created, and the solution of not using the BOS token worked for me:
With BOS | Without BOS |
---|---|
In both cases, the correct prompt format is used (Vicuna-v1.1
).
Interestingly, I also created a Llama-2-70b-chat-EXL2-2.500b
and it generated coherent outputs without removing the BOS.
For both quants, I used this parquet file and the following conversion command:
python convert.py \
-i ../Xwin-LM_Xwin-LM-70B-V0.1_safetensors \
-o ~/working \
-cf Xwin-LM-70B-V0.1-EXL2-2.500b \
-c Evol-Instruct-Code-80k-v1.parquet \
-b 2.500
It makes sense for Llama2-chat that it would work with the BOS token, since the prompt format includes it on every round. For that Xwin model (and however many others) it'll be down to how it's been finetuned. There isn't really a "correct" way to format training examples, so as long as people publishing models don't include some examples of fully formatted training data (with the resulting encoding), we end up having to guess.
Here, I would speculate that the finetuning was somewhat aggressive and maybe used <s>
in an unusual way. Maybe every training example had <s>
followed by a newline token or some such, and now the model gets wildly confused whenever <s>
is followed by anything else. Who can say.
To make matters worse, depending on how much the model has been finetuned (learning rate, number of epochs and so on) it may still output coherent text even if it isn't being prompted correctly, it just won't necessarily behave as intended. E.g. it might adhere to the prompt format most of the time, then suddenly deviate from it for no apparent reason.
Here, I would speculate that the finetuning was somewhat aggressive and maybe used
<s>
in an unusual way. Maybe every training example had<s>
followed by a newline token or some such, and now the model gets wildly confused whenever<s>
is followed by anything else. Who can say.
That would make sense. What can be done then is to let people know about the need to remove the BOS token. I have uploaded the two quants that I mentioned to HF and added a small note to the README about this issue:
https://huggingface.co/oobabooga/Llama-2-70b-chat-EXL2-2.500b
https://huggingface.co/oobabooga/Xwin-LM-70B-V0.1-EXL2-2.500b
If it's Llama2-chat, you should use the
llama
mode. There are some conversions here of Llama2-70B-chat that you can compare against to see if the problem is with the quantized model or if something's going wrong in inference.
The conversions you provided worked for me (without removing the BOS)! There could be something wrong in my quantization process, but I did manage to quantize a 13B model and ran it with no problem.
I also noticed a difference that TGW can auto-detect the model loader for your conversions but not mine (even for the 13B model).
Hey All, just another data point to add to the discussion. I quantized Xwin-LM-70B-V0.1 with 8.000 bit and it worked without needing to do anything to the default Oob text gen webui settings (BOS is being used). I used the same model and training data as documented in Oobabooga's post, I did change the last layer bit precision to 8 instead of the default 6.
I'll be working on llama2-70B-chat conversion later today with the same 8.000 bit setup.
I have reproduced this issue with a
Xwin-LM-70B-V0.1-EXL2-2.500b
that I created, and the solution of not using the BOS token worked for me: With BOS Without BOSIn both cases, the correct prompt format is used (
Vicuna-v1.1
).Interestingly, I also created a
Llama-2-70b-chat-EXL2-2.500b
and it generated coherent outputs without removing the BOS.For both quants, I used this parquet file and the following conversion command:
python convert.py \ -i ../Xwin-LM_Xwin-LM-70B-V0.1_safetensors \ -o ~/working \ -cf Xwin-LM-70B-V0.1-EXL2-2.500b \ -c Evol-Instruct-Code-80k-v1.parquet \ -b 2.500
I have the exact same issue. I downloaded turboderp_Llama2-70B-exl2 2.5 bpw. I have a 3090 24GB. The LLM is hung up on the word "cord" mostly: How do I fix this I do not understand your comment completely. I am new to this but my understanding is that the LLM gets confused by the BOS token and that there is am option to disable it? Were do I find the option in the oobabooga gui ?
It's under parameters->generation, on the right-hand side of the page.
Closing this now. For reference the new measurement procedure seems to have fixed the cord cord stringbuilder issue on low bitrate models, regardless of BOS settings.
Hi, I am new to LLM quantization and wanted to use text-generation-webui to run a local LLM chatbot.
ExLlamaV2 worked great when I applied it to the 13B model with 8bpw, but when it came to the 70B model with 2.5bpw, I got strange results as below:
I think this was not the problem of the web UI because it also occurred while using
examples/chat.py
.