Planned/Potential of significant work

Narsil commented 3 months ago

[ ] Fp8 kv-cache
[ ] Kv-cache prefix reuse
[ ] Grammar constrained speedup
[ ] torch.compile like speedups
[ ] Simple one-liner pip install
[ ] Multi lora support (lorax kind of)
[x] Marlin quantization
[x] Exl2 quants (non gptq 2, 3, 4.5 bpw)
[ ] Adding more documentation/help regarding getting production grafana dashboards easily setup.

josephrocca commented 1 month ago

FWIW, I've experienced basically no degradation in my internal evals with a Llama2-70B AWQ model + 4 bit kv-cache using LMDeploy. They have a little public eval data here:

https://github.com/InternLM/lmdeploy/blob/main/docs/en/quantization/kv_quant.md

This becomes a big advantage with KV-cache prefix caching, since you can store twice as many concurrent chats in VRAM, which increases the cache hit rate for me substantially.

Overall, I got a ~4x real-world reduction in cost via 4-bit KV-cache + prefix caching. I was not expecting anything close to that before trying this.

Wondering if there's any super rough expected timeline on FP8 KV-cache + prefix caching/reuse? I.e. if planned for soon, or still likely a few months away at least.

github-actions[bot] commented 2 days ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

huggingface / text-generation-inference

Planned/Potential of significant work #1819