Closed candre23 closed 2 months ago
phi3-mini can be used as the draft model for phi3-medium in speculative decoding... also with optimized kernels exllama has... both the performance and the inference speed is gonna be pretty crazzzzzzy! I can't wait too see it : )
bump : )
It's in dev branch now. Model here. Bit more testing then I'll release a new version.
How much VRAM does 32k / 128k context need as 3.8b doesn't have GQA?
It works out to 384 kB per token if I'm not mistaken. So a 128k context would need 48 GB for the cache. It drops to 108 kB with Q4 cache, or 13.5 GB for the full context.
Here's hoping the larger Phi-3 models use GQA.
I hope that the larger Phi-3 models will at least be released. But thanks, 13.5GB for full context with Q4 cache sounds great. Thanks for your work.
Just tried the 6.0bpw model from turboderp HF Phi3 repo - with Q4 cache, loads nicely into 19GB vram. Chat completions for short context are fine, but unfortunately longer contexts -- beyond a few hundred tokens, produces garbage. Possibly I am doing something wrong!
PS using latest exllama v2 (which I understand was just updated for Phi3) and TabbyAPI (which now allows setting Q4 cache via the admin API)
You need the dev version. Currently working on releasing 0.0.20 which will have the Phi stuff included.
I'm not seeing any issues with long context myself. Are you sure you're using the right formatting? Can you run it in the chatbot example perhaps?
python examples/chat.py -m /path/to/phi3-mini-128k-instruct-exl2/6.0bpw/ -mode phi3 -cq4
With many claiming that phi3 mini is uncannily good for it's size, and with larger, actually-useful phi3 models on the way, adding support for this arch is almost certainly worthwhile.