turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.2k stars 236 forks source link

Is Smaug 70B supported? #334

Closed rjmehta1993 closed 2 weeks ago

rjmehta1993 commented 4 months ago

Smaug 70B, the top on the leaderboard doesn't work. It uses Qwen as the base model but the architecture is LlamaforcausualLM.

Is this unsupported yet, or am I using it incorrectly?

https://huggingface.co/abacusai/Smaug-72B-v0.1

KaraKaraWitch commented 4 months ago

You need to quantize the model to exl2.

rjmehta1993 commented 4 months ago

The problem is not the quantization. I have the 4bit quantized model. But exllamav2 doesn't support qwen1.5 architecture yet.

mymymy1303 commented 4 months ago

After reading the code for some time, I could quantize the model to exl2.

Here's the trick:

image

I am not sure how to open a pull request cause I have never done this before.

turboderp commented 4 months ago

I'm working on Qwen support at the moment. There's a bit more to it than just enabling bias in the Torch linear modules, but not much more. It's pretty much done, and I just need to test it a bit, so expect an update soon.

mymymy1303 commented 4 months ago

I'm working on Qwen support at the moment. There's a bit more to it than just enabling bias in the Torch linear modules, but not much more. It's pretty much done, and I just need to test it a bit, so expect an update soon.

Sure. I really look forward to it.

Pevernow commented 4 months ago

I'm working on Qwen support at the moment. There's a bit more to it than just enabling bias in the Torch linear modules, but not much more. It's pretty much done, and I just need to test it a bit, so expect an update soon.

It's great and I'm really looking forward to it.

If there is any new news, please @ me.

turboderp commented 4 months ago

Smaug-72B and Qwen1.5 are supported now. Took some doing. :sleepy:

Smaug quants here (2.5bpw still uploading)

Note that because of the lack of GQA, the default context length of 32768 requires 80 GB of VRAM just for the cache, so you probably want to limit it.

@Pevernow

JackCloudman commented 4 months ago

So thanks @turboderp !

strikeoncmputrz commented 4 months ago

@turboderp First, thanks for the awesome inference library!

Tokenizers is not included in the requirements for this repo. Is that intentional?

When attempting to load Smaug I'm receiving this error: Attempting to load HF Tokenizer, but Tokenizers library is not installed. If I pip install tokenizers in my venv this issue is resolved.

$ python test_inference.py -m /nvme/LLMs/turboderp_Smaug-72B-exl2_3.0bpw --gpu_split auto -p "
Here is a funny joke about linux"                                                             
 -- Model: /nvme/LLMs/turboderp_Smaug-72B-exl2_3.0bpw                                         
 -- Options: ['gpu_split: auto']                                                              
 -- Loading tokenizer...                                                                      
Traceback (most recent call last):                                                            
  File "/home/x0xxin/exllamav2/test_inference.py", line 83, in <module>                       
    model, tokenizer = model_init.init(args, allow_auto_split = True, skip_load = args.stream_
layers, benchmark = True)                                                                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^                                                                     
  File "/home/x0xxin/exllamav2/exllamav2/model_init.py", line 112, in init
    tokenizer = ExLlamaV2Tokenizer(config)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/x0xxin/exllamav2/exllamav2/tokenizer.py", line 66, in __init__
    elif os.path.exists(path_hf): self.tokenizer = ExLlamaV2TokenizerHF(path_hf)
                                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/x0xxin/exllamav2/exllamav2/tokenizers/hf.py", line 21, in __init__
    assert self.is_supported(), "Attempting to load HF tokenizer, but Tokenizers library is no
t installed"                                                                                  
AssertionError: Attempting to load HF tokenizer, but Tokenizers library is not installed

After pip install tokenizers

Installing collected packages: tokenizers
Successfully installed tokenizers-0.15.2
(exui) x0xxin at llama in ~/exllamav2 on master*

$ python test_inference.py -m /nvme/LLMs/turboderp_Smaug-72B-exl2_3.0bpw --gpu_split auto -p "Here is a funny joke about linux"
 -- Model: /nvme/LLMs/turboderp_Smaug-72B-exl2_3.0bpw
 -- Options: ['gpu_split: auto']
 -- Loading tokenizer...
 -- Loading model...
 -- Loaded model in 23.4381 seconds
 -- Warmup...
 -- Generating...

Here is a funny joke about linux and its command line. Enjoy this one!
A man walks into a bar and orders a drink. The bartender pours him a glass of beer and asks, "What's your name?"
The man replies, "grep."
"What kind of name is that?" the bartender asks.
"It's my username," explains the man. "I'm a Linux user."
"Oh, ok," says the bartender. "What do you do for a living?"
"I'm a programmer," the man replies. "I spend most of my time at the command line."
The bartender nods, then turns to another customer in the bar and shouts, "Hey,

 -- Response generated in 11.16 seconds, 128 tokens, 11.47 tokens/second (includes prompt eval.)
turboderp commented 4 months ago

@strikeoncmputrz It's intentional, yes. ExLlama works without the Tokenizers library, using just SentencePiece for models that provide a SentencePiece tokenizer (tokenizer.model). The Tokenizes library is there as a fallback for other models that don't, like Qwen which uses Tiktoken.

mymymy1303 commented 4 months ago

@turboderp Thank you for such great work. It seems to me that something went wrong. In the image below, the highlighted text in the response is where I expected the model to stop at. But for some reason, it continues to generate text that looks like its training data. Do you have any idea? (The model I used is Smaug-72B-exl2 4.0bpw)

image

turboderp commented 4 months ago

@mymymy1303 Which prompt format are you using? And do you have stop conditions properly set up?

mymymy1303 commented 4 months ago

@mymymy1303 Which prompt format are you using? And do you have stop conditions properly set up?

@turboderp I used PromptFormatLlama with default stop conditions (tokenizer.eos_token_id). Essentially, I used all default options with no specific parameters.

turboderp commented 4 months ago

According to this thread the model was trained without a template, so there's no obvious way to know when a response has ended.

What looks like training data is just the model hallucinating the next question. All models do this, but some have a well-defined stop condition so you can cut the output stream off where you'd want to insert your actual next question and not just whatever the model thinks is a likely continuation of the pattern in the prompt.

This stop condition is usually the EOS token, but some models are confused, and Smaug seems to have been trained on a bunch of different sources that disagree on what that token should be, which I guess is why its tokenizer has both <|endoftext|> (OpenChat prompt format) and <|im_end|> (ChatML prompt format) tokens defined.

Llama usually calls the EOS token <\s>, but in the end it's just a number, and if the training was done on examples containing EOS tokens and they were correctly encoded during training (according to the config.json provided) it shouldn't be an issue. Still, you're essentially asking the model to guess what you expect the stop condition to look like, and it will base its guess on details of the prompt that you might not be considering. The double quotes after the response are suspicious in that regard, suggesting maybe there's some extra quotation marks in the prompt which the model then incorrectly interprets are part of your prompt format.

In any case, it would help to know what the inference code looks like and how exactly you're applying the prompt format.

Lyrcaxis commented 3 months ago

Abacus Models seem to be unable to return their special tokens, including <|im_end|> so their outputs go on and on. This appears to be fine: image But this is not: image

..because pad_token is before the <|im_end|> token, and code's overriding their logits forcibly as well. image

By commenting these lines out Qwen seems to be behaving properly, but I'll leave how you handle it up to you, coz you probably had a reason to ban these tokens that I don't know of. Everything seems to be working fine without issues, however. https://github.com/turboderp/exllamav2/blob/master/exllamav2/model.py#L669

In addition, I added a functionality on my local copy where single-token strings can be recognized: image

tokenizer_config.json for rerefence: https://huggingface.co/Qwen/Qwen1.5-72B-Chat/blob/main/tokenizer_config.json

Kerushii commented 3 months ago

https://huggingface.co/blockblockblock/Smaug-72B-v0.1-bpw2.5 https://huggingface.co/blockblockblock/Smaug-72B-v0.1-bpw3 https://huggingface.co/blockblockblock/Smaug-72B-v0.1-bpw3.5 https://huggingface.co/blockblockblock/Smaug-72B-v0.1-bpw3.7 https://huggingface.co/blockblockblock/Smaug-72B-v0.1-bpw4 https://huggingface.co/blockblockblock/Smaug-72B-v0.1-bpw4.2 https://huggingface.co/blockblockblock/Smaug-72B-v0.1-bpw4.4 https://huggingface.co/blockblockblock/Smaug-72B-v0.1-bpw4.6 https://huggingface.co/blockblockblock/Smaug-72B-v0.1-bpw4.8 https://huggingface.co/blockblockblock/Smaug-72B-v0.1-bpw5 https://huggingface.co/blockblockblock/Smaug-72B-v0.1-bpw5.5 https://huggingface.co/blockblockblock/Smaug-72B-v0.1-bpw6

quanted here

turboderp commented 2 weeks ago

Smaug, Qwen and Qwen2 should all be fine now.