Closed rjmehta1993 closed 2 weeks ago
You need to quantize the model to exl2.
The problem is not the quantization. I have the 4bit quantized model. But exllamav2 doesn't support qwen1.5 architecture yet.
After reading the code for some time, I could quantize the model to exl2.
Here's the trick:
I am not sure how to open a pull request cause I have never done this before.
I'm working on Qwen support at the moment. There's a bit more to it than just enabling bias in the Torch linear modules, but not much more. It's pretty much done, and I just need to test it a bit, so expect an update soon.
I'm working on Qwen support at the moment. There's a bit more to it than just enabling bias in the Torch linear modules, but not much more. It's pretty much done, and I just need to test it a bit, so expect an update soon.
Sure. I really look forward to it.
I'm working on Qwen support at the moment. There's a bit more to it than just enabling bias in the Torch linear modules, but not much more. It's pretty much done, and I just need to test it a bit, so expect an update soon.
It's great and I'm really looking forward to it.
If there is any new news, please @ me.
Smaug-72B and Qwen1.5 are supported now. Took some doing. :sleepy:
Smaug quants here (2.5bpw still uploading)
Note that because of the lack of GQA, the default context length of 32768 requires 80 GB of VRAM just for the cache, so you probably want to limit it.
@Pevernow
So thanks @turboderp !
@turboderp First, thanks for the awesome inference library!
Tokenizers is not included in the requirements for this repo. Is that intentional?
When attempting to load Smaug I'm receiving this error: Attempting to load HF Tokenizer, but Tokenizers library is not installed
. If I pip install tokenizers
in my venv this issue is resolved.
$ python test_inference.py -m /nvme/LLMs/turboderp_Smaug-72B-exl2_3.0bpw --gpu_split auto -p "
Here is a funny joke about linux"
-- Model: /nvme/LLMs/turboderp_Smaug-72B-exl2_3.0bpw
-- Options: ['gpu_split: auto']
-- Loading tokenizer...
Traceback (most recent call last):
File "/home/x0xxin/exllamav2/test_inference.py", line 83, in <module>
model, tokenizer = model_init.init(args, allow_auto_split = True, skip_load = args.stream_
layers, benchmark = True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/x0xxin/exllamav2/exllamav2/model_init.py", line 112, in init
tokenizer = ExLlamaV2Tokenizer(config)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/x0xxin/exllamav2/exllamav2/tokenizer.py", line 66, in __init__
elif os.path.exists(path_hf): self.tokenizer = ExLlamaV2TokenizerHF(path_hf)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/x0xxin/exllamav2/exllamav2/tokenizers/hf.py", line 21, in __init__
assert self.is_supported(), "Attempting to load HF tokenizer, but Tokenizers library is no
t installed"
AssertionError: Attempting to load HF tokenizer, but Tokenizers library is not installed
After pip install tokenizers
Installing collected packages: tokenizers
Successfully installed tokenizers-0.15.2
(exui) x0xxin at llama in ~/exllamav2 on master*
$ python test_inference.py -m /nvme/LLMs/turboderp_Smaug-72B-exl2_3.0bpw --gpu_split auto -p "Here is a funny joke about linux"
-- Model: /nvme/LLMs/turboderp_Smaug-72B-exl2_3.0bpw
-- Options: ['gpu_split: auto']
-- Loading tokenizer...
-- Loading model...
-- Loaded model in 23.4381 seconds
-- Warmup...
-- Generating...
Here is a funny joke about linux and its command line. Enjoy this one!
A man walks into a bar and orders a drink. The bartender pours him a glass of beer and asks, "What's your name?"
The man replies, "grep."
"What kind of name is that?" the bartender asks.
"It's my username," explains the man. "I'm a Linux user."
"Oh, ok," says the bartender. "What do you do for a living?"
"I'm a programmer," the man replies. "I spend most of my time at the command line."
The bartender nods, then turns to another customer in the bar and shouts, "Hey,
-- Response generated in 11.16 seconds, 128 tokens, 11.47 tokens/second (includes prompt eval.)
@strikeoncmputrz It's intentional, yes. ExLlama works without the Tokenizers library, using just SentencePiece for models that provide a SentencePiece tokenizer (tokenizer.model). The Tokenizes library is there as a fallback for other models that don't, like Qwen which uses Tiktoken.
@turboderp Thank you for such great work. It seems to me that something went wrong. In the image below, the highlighted text in the response is where I expected the model to stop at. But for some reason, it continues to generate text that looks like its training data. Do you have any idea? (The model I used is Smaug-72B-exl2 4.0bpw)
@mymymy1303 Which prompt format are you using? And do you have stop conditions properly set up?
@mymymy1303 Which prompt format are you using? And do you have stop conditions properly set up?
@turboderp I used PromptFormatLlama with default stop conditions (tokenizer.eos_token_id). Essentially, I used all default options with no specific parameters.
According to this thread the model was trained without a template, so there's no obvious way to know when a response has ended.
What looks like training data is just the model hallucinating the next question. All models do this, but some have a well-defined stop condition so you can cut the output stream off where you'd want to insert your actual next question and not just whatever the model thinks is a likely continuation of the pattern in the prompt.
This stop condition is usually the EOS token, but some models are confused, and Smaug seems to have been trained on a bunch of different sources that disagree on what that token should be, which I guess is why its tokenizer has both <|endoftext|>
(OpenChat prompt format) and <|im_end|>
(ChatML prompt format) tokens defined.
Llama usually calls the EOS token <\s>
, but in the end it's just a number, and if the training was done on examples containing EOS tokens and they were correctly encoded during training (according to the config.json provided) it shouldn't be an issue. Still, you're essentially asking the model to guess what you expect the stop condition to look like, and it will base its guess on details of the prompt that you might not be considering. The double quotes after the response are suspicious in that regard, suggesting maybe there's some extra quotation marks in the prompt which the model then incorrectly interprets are part of your prompt format.
In any case, it would help to know what the inference code looks like and how exactly you're applying the prompt format.
Abacus Models seem to be unable to return their special tokens, including <|im_end|>
so their outputs go on and on.
This appears to be fine:
But this is not:
..because pad_token is before the <|im_end|>
token, and code's overriding their logits forcibly as well.
By commenting these lines out Qwen seems to be behaving properly, but I'll leave how you handle it up to you, coz you probably had a reason to ban these tokens that I don't know of. Everything seems to be working fine without issues, however. https://github.com/turboderp/exllamav2/blob/master/exllamav2/model.py#L669
In addition, I added a functionality on my local copy where single-token strings can be recognized:
tokenizer_config.json
for rerefence: https://huggingface.co/Qwen/Qwen1.5-72B-Chat/blob/main/tokenizer_config.json
https://huggingface.co/blockblockblock/Smaug-72B-v0.1-bpw2.5 https://huggingface.co/blockblockblock/Smaug-72B-v0.1-bpw3 https://huggingface.co/blockblockblock/Smaug-72B-v0.1-bpw3.5 https://huggingface.co/blockblockblock/Smaug-72B-v0.1-bpw3.7 https://huggingface.co/blockblockblock/Smaug-72B-v0.1-bpw4 https://huggingface.co/blockblockblock/Smaug-72B-v0.1-bpw4.2 https://huggingface.co/blockblockblock/Smaug-72B-v0.1-bpw4.4 https://huggingface.co/blockblockblock/Smaug-72B-v0.1-bpw4.6 https://huggingface.co/blockblockblock/Smaug-72B-v0.1-bpw4.8 https://huggingface.co/blockblockblock/Smaug-72B-v0.1-bpw5 https://huggingface.co/blockblockblock/Smaug-72B-v0.1-bpw5.5 https://huggingface.co/blockblockblock/Smaug-72B-v0.1-bpw6
quanted here
Smaug, Qwen and Qwen2 should all be fine now.
Smaug 70B, the top on the leaderboard doesn't work. It uses Qwen as the base model but the architecture is LlamaforcausualLM.
Is this unsupported yet, or am I using it incorrectly?
https://huggingface.co/abacusai/Smaug-72B-v0.1