Closed nivibilla closed 7 months ago
So far I've heard it requires 2x80gb gpus to run. Or 4x 40gb. Which is just shy of the 140gb needed for 70b fp16. So if it's possible to run a 70b on 24gb then I hope it should be possible to run the MoE on a single 24gb gpu
I don't doubt that it's possible.
The main challenges are:
I look forward to some sort of announcement from Mistral. Maybe even some reference code..
So far I've heard it requires 2x80gb gpus to run. Or 4x 40gb. Which is just shy of the 140gb needed for 70b fp16. So if it's possible to run a 70b on 24gb then I hope it should be possible to run the MoE on a single 24gb gpu
It runs on 4 3090's taking under 92gb in 16fp - not sure where these other figures are from. I got it running yesterday.
No, it can run in 2x3090 with 8-bit or 4-bit quantization using bitsandbytes, but it runs extremely slow. The only way to make it practical is with exllama or similar. Not even GPTQ works right now.
https://twitter.com/4evaBehindSOTA/status/1733551103105720601?t=SiKV8qH1IIoKiQlRBiIR5w&s=19
Tim from bitsandbytes says he's done MoE quantisation before. And that its actually easier than vanilla transformers.
I really hope it get's exllama support. The model is multilingual and seems very powerful. Hope it can be implemented.
The official inference code for vllm etc dropped hours ago.
The mixtral pr got merged into transformers, so you can look at their implementation: https://github.com/huggingface/transformers/pull/27942
Edit: Llama.cpp is now also adding mixtral: https://github.com/ggerganov/llama.cpp/pull/4406
When you do implement it, please don't forget the ROCm implementation 🙇
NICE, glad to see this is being talked about 👍
When attempting to use Mixtral-8x7B-v0.1-GPTQ from https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GPTQ with chat.py, this error occurs:
python examples/chat.py -m C:\Users\danu0\Downloads\Artificial-Intelligence\Mixtral-8x7B-v0.1-GPTQ -mode raw
-- Model: C:\Users\danu0\Downloads\Artificial-Intelligence\Mixtral-8x7B-v0.1-GPTQ
-- Options: ['rope_scale 1.0', 'rope_alpha 1.0']
Traceback (most recent call last):
File "C:\Users\danu0\Downloads\Artificial-Intelligence\exllamav2\examples\chat.py", line 81, in <module>
model, tokenizer = model_init.init(args)
File "C:\Users\danu0\AppData\Local\Programs\Python\Python310\lib\site-packages\exllamav2\model_init.py", line 64, in init
config.prepare()
File "C:\Users\danu0\AppData\Local\Programs\Python\Python310\lib\site-packages\exllamav2\config.py", line 133, in prepare
raise ValueError(f" ## Could not find {prefix}.* in model")
ValueError: ## Could not find model.layers.0.mlp.down_proj.* in model
It's a different architecture. It won't work out of the box. Pls wait for Turbo lol.
He is working on it. In the experimental branch, there is already a preview. (Which works but is unoptimized)
All done. For now.
Not an issue but seeing that exl2 2 bit quants of a 70b model can fit in a single 24gb GPU. I'm wondering if it's possible to run a quantized version of mixtral 7b*8 on a single 24gb GPU. And if that's something exllama2 could support or a completely different project?
Mistral MoE 7b*8 model https://twitter.com/MistralAI/status/1733150512395038967?t=6jDOugc19MUNyOV1KK6Ing&s=19