turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.19k stars 234 forks source link

Phi-3 Support #425

Closed candre23 closed 2 months ago

candre23 commented 2 months ago

With many claiming that phi3 mini is uncannily good for it's size, and with larger, actually-useful phi3 models on the way, adding support for this arch is almost certainly worthwhile.

nooobodynose commented 2 months ago

phi3-mini can be used as the draft model for phi3-medium in speculative decoding... also with optimized kernels exllama has... both the performance and the inference speed is gonna be pretty crazzzzzzy! I can't wait too see it : )

gabinguo commented 2 months ago

bump : )

turboderp commented 2 months ago

It's in dev branch now. Model here. Bit more testing then I'll release a new version.

CyberTimon commented 2 months ago

How much VRAM does 32k / 128k context need as 3.8b doesn't have GQA?

turboderp commented 2 months ago

It works out to 384 kB per token if I'm not mistaken. So a 128k context would need 48 GB for the cache. It drops to 108 kB with Q4 cache, or 13.5 GB for the full context.

Here's hoping the larger Phi-3 models use GQA.

CyberTimon commented 2 months ago

I hope that the larger Phi-3 models will at least be released. But thanks, 13.5GB for full context with Q4 cache sounds great. Thanks for your work.

arbi-dev commented 2 months ago

Just tried the 6.0bpw model from turboderp HF Phi3 repo - with Q4 cache, loads nicely into 19GB vram. Chat completions for short context are fine, but unfortunately longer contexts -- beyond a few hundred tokens, produces garbage. Possibly I am doing something wrong!
PS using latest exllama v2 (which I understand was just updated for Phi3) and TabbyAPI (which now allows setting Q4 cache via the admin API)

turboderp commented 2 months ago

You need the dev version. Currently working on releasing 0.0.20 which will have the Phi stuff included.

I'm not seeing any issues with long context myself. Are you sure you're using the right formatting? Can you run it in the chatbot example perhaps?

python examples/chat.py -m /path/to/phi3-mini-128k-instruct-exl2/6.0bpw/ -mode phi3 -cq4