turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.18k stars 233 forks source link

Qwen 2 inference problem #493

Closed Sadeghi85 closed 2 weeks ago

Sadeghi85 commented 3 weeks ago

Discussed in https://github.com/turboderp/exllamav2/discussions/492

Originally posted by **Sadeghi85** June 7, 2024 I tried [ExLlamaV2 of Qwen 2 7b](https://huggingface.co/LoneStriker/Qwen2-7B-Instruct-8.0bpw-h8-exl2), but the output is gibberish. There is a discussion over at llama.cpp: https://github.com/ggerganov/llama.cpp/issues/7805
Sadeghi85 commented 3 weeks ago

There is also an issue on Qwen repo:

https://github.com/QwenLM/Qwen2/issues/485

turboderp commented 3 weeks ago

So I've narrowed it down to the attention function, and I've committed a possible solution to the dev branch. I say possible because it comes down to some internal switching logic in PyTorch that I'm not entirely sure about, but basically, since Torch 2.3.0 now supports lower-right causal masking (finally!) ExLlama can use SDPA instead of matmul attention. SDPA uses upcasting in the fused attention kernel which prevents the overflow and at least Qwen2-7B seems to be working without flash-attn.

I'm not able to test xformers since I can't find a prebuilt wheel and the dependencies are broken at the moment.

There still seem to be some issues with the Q4 cache, working on those.

bartowski1182 commented 3 weeks ago

@turboderp any chance that that upcasting would benefit P40 performance by using FP32? 👀

what do you mean xformers dependencies are broken specifically? I'm using torch 2.3.0 and xformers with "no" issue, but maybe i'm missing something

turboderp commented 3 weeks ago

xformers was working, but I currently don't have it installed and I can't install it because Arch updated me to CUDA 12.5 and gcc13, and I can't downgrade because earlier CUDA versions need gcc12 which I can't install alongside gcc13. xformers refuses to compile because of incompatibilities with CUDA 12.5, so I'm kinda stuck unless I want to spend the next however many hours trying to get all the right versions of everything synced up. I guess with timeshift I'm pretty sure I won't completely brick my desktop, but it's still not a very appealing thought.

As for upscaling, no, now that it will default to SDPA on Torch 2.3.0, attention should run smoother and there are other places I could switch over to FP32 compute, but specifically the matmul kernels need some special attention to work in FP32. Perhaps it could be done.. eh... so much else on the list too though.

bartowski1182 commented 3 weeks ago

Have you considered docker? I run CUDA 12.2 in docker with torch 2.3.0 and xformers, I can walk you through it. Probably wouldn't be your endgame solution but would help you figure this out

If the p40 performance isn't basically free I wouldn't bother, GGUF performance is good enough for that specific card, exllamav2 should just stay as the SOTA for SOTA cards rather than bend over backwards to get tiny amounts of gains from ancient cards lol, was just curious

Ph0rk0z commented 2 weeks ago

I compile it in a conda environment to avoid this issue. For the P40, just xformers may be enough. I only have one pascal card left in use at the moment so I should try it and see what happens. For SD it automatically sped up inference regardless of my compute setting. But lots of other non "SOTA" cards benefit from xformers too.

turboderp commented 2 weeks ago

Added Q8 cache mode now which seems to work great with Qwen2-7B.

bartowski1182 commented 2 weeks ago

Oh hell yes, been looking forward to Q8

turboderp commented 2 weeks ago

Q6 also works well with this model, available in v0.1.5 now

waterangel91 commented 2 weeks ago

So at the moment, Q4 is not working but Q6 and Q8 working?

turboderp commented 2 weeks ago

Correct. Though, it's worth noting that Qwen2-7B already has a very small cache. With FP16 precision it's 56 kB per token, vs. 128 kB per token for Llama3-8B (or 512 kB per token for Llama2-7B (!))

So overall, Qwen2-7B with Q6 cache still uses about 30% less VRAM per token than Llama3-8B with Q4 cache. For precision, I did some quick HumanEval tests, and it's within the margin of error from Q6 and up:

Model Quant Cache pass@1 pass@10 Wikitext 5x1k
Qwen2-7B FP16 Q4 19.74% 46.34% 40.72
Qwen2-7B FP16 Q6 61.65% 81.70% 15.20
Qwen2-7B FP16 Q8 62.37% 81.09% 15.18
Qwen2-7B FP16 FP16 61.16% 82.31% 15.16
Sadeghi85 commented 2 weeks ago

I tested v0.1.5 and it's working, thanks.

What is the difference between 8bit cache and Q caches? Because in v0.1.5 only 8bit doesn't work for me, all Q caches are working.

turboderp commented 2 weeks ago

The 8-bit mode is FP8, and it's deprecated. It performs worse than Q4 in every respect. But Q4 is very unreliable for Qwen2-7B.

waterangel91 commented 2 weeks ago

Can i ask if Qwen2 72B is working with Q4 cache? Just now i tried but it seems like it was having non stop generation. It could also be issue on my end.

turboderp commented 2 weeks ago

The 72B version seems to work fine with Q4

Model Quant Cache pass@1 pass@10 Wikitext 5x1k
Qwen2-72B 6.0bpw Q4 70.36 87.19 10.31
Qwen2-72B 6.0bpw Q6 69.32 85.36 10.26
Qwen2-72B 6.0bpw Q8 71.28 85.36 10.23
Qwen2-72B 6.0bpw FP16 70.8 83.5 10.17