huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
8.48k stars 968 forks source link

Planned/Potential of significant work #1819

Open Narsil opened 3 months ago

Narsil commented 3 months ago
josephrocca commented 1 month ago

FWIW, I've experienced basically no degradation in my internal evals with a Llama2-70B AWQ model + 4 bit kv-cache using LMDeploy. They have a little public eval data here:

This becomes a big advantage with KV-cache prefix caching, since you can store twice as many concurrent chats in VRAM, which increases the cache hit rate for me substantially.

Overall, I got a ~4x real-world reduction in cost via 4-bit KV-cache + prefix caching. I was not expecting anything close to that before trying this.


Wondering if there's any super rough expected timeline on FP8 KV-cache + prefix caching/reuse? I.e. if planned for soon, or still likely a few months away at least.

github-actions[bot] commented 2 days ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.