turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.53k stars 274 forks source link

Yi-Yi 2x34b+ merges generate very slowly. #293

Open Ph0rk0z opened 8 months ago

Ph0rk0z commented 8 months ago

There's been a few people stacking YI models and the results are rather good. Unfortunately they are slower than a 70b especially when using their extended context.

Can anything be done? Is because of how the sizes of the matrices end up? I don't hear much about this issue from the GGUF users, correct me if I'm wrong.

Some examples: https://huggingface.co/cloudyu/Mixtral_34Bx2_MoE_60B

https://huggingface.co/Weyaxi/Bagel-Hermes-2x34b

https://huggingface.co/Weyaxi/Cosmosis-3x34B

https://huggingface.co/Weyaxi/Astralis-4x34B

They do quantize correctly, at least the 2x: https://huggingface.co/LoneStriker/Bagel-Hermes-2x34b-6.0bpw-h6-exl2

turboderp commented 8 months ago

I looked at the Bagel-Hermes EXL2 model and I'm getting speeds roughly equivalent to a 70B model at the same bitrate.

This is not unexpected since the number of experts per token is set to two. That means it has almost as many parameters as a 70B model and some extra operations during inference to make up the difference. You can run it with -ept 1 or change the num_experts_per_tok value in the config.json to limit it to one expert. Then it runs faster but it's anyone's guess if it works any better than either of the 34B models it was made from.

The 3x34B and 4x34B models at least should have almost the same per-token latency as 2x34B, so there's that.

As for the speed dropping with longer context, that's just how transformers work. GGUF isn't going to be any better in that respect, and (at least if you have it installed) ExLlama will use flash-attn which is still SOTA for exact attention on long contexts (i.e. not counting various context compression and sliding window methods.)

Ph0rk0z commented 8 months ago

I did perplexity tests on this model. It has to be run at full experts to be of any benefit. I think even on wiki-text since the router isn't trained.

As to the speeds on a 5 bit 70b with 8k max_seq I get 14-15 t/s without any serious context piled on. Roughly 22t prompt. On bagel-hermes I only get 7-8. It is half as fast doing the exact same prompt and outputting 512 tokens. 30s vs 60s total reply time. If I was getting equivalent speeds I wouldn't have brought it up.

turboderp commented 8 months ago

There is some overhead from the routing. It does roughly the same amount of processing as a 70B model but it's split into smaller portions so you might be seeing some overhead. What CPU are you running it on?

Ph0rk0z commented 8 months ago

Dual 3090 and xeon v4 for the CPU.

turboderp commented 8 months ago

That could be part of the reason at least. I'll have to do some profiling to see how different the CPU load is between 70B and 2x34B, but even the fastest Xeon v4 has fairly limited single-core performance.

yamosin commented 8 months ago

I have almost the same configuration as you, Xeon E5 2676 v3 + 3x3090 (only using two) and only get 3.5t/s in 2x34b, the same speed of using 4.65bpw or 6bpw, but I get 10~12t/s at goliath 3bpw Although I can see the CPU core usage, a simple test proves that one or more cores do not affect t/s This is running goliath and limiting the usage to a single core, but the t/s doesn't change image

This is the limited 4 cores/1 core footprint when running 2x34b image image

INFO: Metrics: 50 tokens generated in 14.53 seconds (3.44 T/s, context 2467 tokens)
INFO:     127.0.0.1:53707 - "POST /v1/completions HTTP/1.1" 200 OK
INFO: Metrics: 50 tokens generated in 14.25 seconds (3.51 T/s, context 2467 tokens)

I hope this provides some relevant information

Ph0rk0z commented 8 months ago

Am also getting about 12-13t/s on 103b @ 3.5 bpw. I loaded it with 8192 context. Maybe it's something about this architecture?

sat0r1r1 commented 5 months ago

Same question, on 120B I can get about 12t/s. However, Yi-34Bx2-MOE-60B can only get 3t/s. I've tried exl2 3,4,5-bit and the result is the same.