turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.45k stars 257 forks source link

self.rms_norm_eps = read_config["rms_norm_eps"] KeyError: 'rms_norm_eps' (Qwen model not supported) #160

Closed tutu329 closed 8 months ago

tutu329 commented 9 months ago

Qwen is the sota open source llm in China and its 72b-chat model will be released this month. Qwen-int4 is supported by autogptq. but it will become very slow run in multiple gpus. so if exllama supports model like Qwen-72b-chat-gptq, it shall be so exciting!

tutu329 commented 9 months ago

Qwen is the sota open source llm in China and its 72b-chat model will be released this month. Qwen-int4 is supported by autogptq. but it will become very slow run in multiple gpus. so if exllama supports model like Qwen-72b-chat-gptq, it shall be so exciting!

i tested exllamav2 on model like llama-2-70b-gptq using 2*2080ti(22g vram) and 13900k. output speed was 13tokens/s. that is amazing!

so support Qwen model like Qwen-14b-chat-gptq (https://huggingface.co/TheBloke/Qwen-14B-Chat-GPTQ) please! thanks a lot!

Ph0rk0z commented 9 months ago

can't it work in textgen UI if trust_remote_code is set and you're using exllamav2_hf?

tutu329 commented 9 months ago

can't it work in textgen UI if trust_remote_code is set and you're using exllamav2_hf?

can't work in exllamav2 ui

lhl commented 9 months ago

I've been poking at Qwen recently. Just for referenence, in case anyone wants to dig into the details, the best documentation of the Qwen architecture is this file in the HF repo: https://huggingface.co/Qwen/Qwen-14B/blob/main/modeling_qwen.py

I also found this script to "llamify" Qwen models btw, which should make quants compatible w/ llama inferencers (you still need the QwenTokenizer): https://github.com/hiyouga/LLaMA-Factory/blob/main/tests/llamafy_qwen.py

turboderp commented 9 months ago

The issue with Qwen is that it's a huge departure from the Llama architecture. If someone wants to submit the code I'll happily entertain a PR, but if not, I'm doing this on my own in my spare time for free, and there are only so many hours in a day.

If the model can be Llamafied, that's a place to start, I guess. I would imagine it's also possible to convert the Tiktoken model to SentencePiece? Somehow?

lhl commented 9 months ago

Qwen's Tokenizer is also pretty crazy and custom, so doubtful on an easy conversion, but for anyone interested, some pointers:

The tokenizer actually has been historically buggy btw. I mean, Qwen is a bit of a wild ride even using the full model with HF transformers (it has very specific version requirements, and I'm honestly still not exactly sure which side it's padding should be - I've seen sample code w/ both, lol). For anyone looking to dive in, be sure to check out the projects issue history (it's a very popular and very active project, uh, mostly in Chinese so have your translator handy): https://github.com/QwenLM/Qwen/issues

Personally, I'd recommend that anyone that wants something Qwen-like to maybe look at https://huggingface.co/CausalLM/14B which not just llamifies the layers, but updates the attention calculations to Llama2 MHA/RoPE, and moved over to a GPT2Tokenizer (which I don't think ExLlama handles, but at least is llama.cpp compatible). Or to look towards Yi if you don't care about licensing or the next Mistral. There's no end of amazing models coming down the pipe, so I think everyone's going to need to pick and choose their battles. :)

(just leaving these notes here in case anyone's looking to dig in, and since all this stuff is really fresh for me, not to encourage or discourage anyone from contributing Qwen support)

CyberTimon commented 9 months ago

+1 would also love to see Qwen 72b support when it finally releases next week! Does anyone know how to llama-fie Qwen? I would love to help but I don't think I have enough knowledge etc to do so

CyberTimon commented 9 months ago

Maybe this can be used for the tokenizer? https://gist.github.com/xenova/a452a6474428de0182b17605a98631ee Here is some more info: https://github.com/QwenLM/Qwen/blob/main/tokenization_note.md

Edit: Maybe we can rename this conversation to like Qwen Support? @tutu329