turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.28k stars 243 forks source link

Mixtral #223

Closed nivibilla closed 7 months ago

nivibilla commented 7 months ago

Not an issue but seeing that exl2 2 bit quants of a 70b model can fit in a single 24gb GPU. I'm wondering if it's possible to run a quantized version of mixtral 7b*8 on a single 24gb GPU. And if that's something exllama2 could support or a completely different project?

Mistral MoE 7b*8 model https://twitter.com/MistralAI/status/1733150512395038967?t=6jDOugc19MUNyOV1KK6Ing&s=19

nivibilla commented 7 months ago

So far I've heard it requires 2x80gb gpus to run. Or 4x 40gb. Which is just shy of the 140gb needed for 70b fp16. So if it's possible to run a 70b on 24gb then I hope it should be possible to run the MoE on a single 24gb gpu

turboderp commented 7 months ago

I don't doubt that it's possible.

The main challenges are:

  1. quantization, specifically, ensuring that the calibration data triggers all of the experts enough, and that you don't end up using a very small sample for an expert that only triggers rarely, and
  2. fast batching (and prompt ingestion), since each token in the batch triggers its own pair of experts.

I look forward to some sort of announcement from Mistral. Maybe even some reference code..

MarkMakers commented 7 months ago

So far I've heard it requires 2x80gb gpus to run. Or 4x 40gb. Which is just shy of the 140gb needed for 70b fp16. So if it's possible to run a 70b on 24gb then I hope it should be possible to run the MoE on a single 24gb gpu

It runs on 4 3090's taking under 92gb in 16fp - not sure where these other figures are from. I got it running yesterday.

ortegaalfredo commented 7 months ago

No, it can run in 2x3090 with 8-bit or 4-bit quantization using bitsandbytes, but it runs extremely slow. The only way to make it practical is with exllama or similar. Not even GPTQ works right now.

nivibilla commented 7 months ago

https://twitter.com/4evaBehindSOTA/status/1733551103105720601?t=SiKV8qH1IIoKiQlRBiIR5w&s=19

Tim from bitsandbytes says he's done MoE quantisation before. And that its actually easier than vanilla transformers.

CyberTimon commented 7 months ago

I really hope it get's exllama support. The model is multilingual and seems very powerful. Hope it can be implemented.

The official inference code for vllm etc dropped hours ago.

CyberTimon commented 7 months ago

The mixtral pr got merged into transformers, so you can look at their implementation: https://github.com/huggingface/transformers/pull/27942

Edit: Llama.cpp is now also adding mixtral: https://github.com/ggerganov/llama.cpp/pull/4406

DutchEllie commented 7 months ago

When you do implement it, please don't forget the ROCm implementation 🙇

neutrino84 commented 7 months ago

NICE, glad to see this is being talked about 👍

DogeLord081 commented 7 months ago

When attempting to use Mixtral-8x7B-v0.1-GPTQ from https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GPTQ with chat.py, this error occurs:

python examples/chat.py -m C:\Users\danu0\Downloads\Artificial-Intelligence\Mixtral-8x7B-v0.1-GPTQ -mode raw
 -- Model: C:\Users\danu0\Downloads\Artificial-Intelligence\Mixtral-8x7B-v0.1-GPTQ
 -- Options: ['rope_scale 1.0', 'rope_alpha 1.0']
Traceback (most recent call last):
  File "C:\Users\danu0\Downloads\Artificial-Intelligence\exllamav2\examples\chat.py", line 81, in <module>
    model, tokenizer = model_init.init(args)
  File "C:\Users\danu0\AppData\Local\Programs\Python\Python310\lib\site-packages\exllamav2\model_init.py", line 64, in init
    config.prepare()
  File "C:\Users\danu0\AppData\Local\Programs\Python\Python310\lib\site-packages\exllamav2\config.py", line 133, in prepare
    raise ValueError(f" ## Could not find {prefix}.* in model")
ValueError:  ## Could not find model.layers.0.mlp.down_proj.* in model 
nivibilla commented 7 months ago

It's a different architecture. It won't work out of the box. Pls wait for Turbo lol.

CyberTimon commented 7 months ago

He is working on it. In the experimental branch, there is already a preview. (Which works but is unoptimized)

turboderp commented 7 months ago

All done. For now.