Open Narsil opened 3 months ago
FWIW, I've experienced basically no degradation in my internal evals with a Llama2-70B AWQ model + 4 bit kv-cache using LMDeploy. They have a little public eval data here:
This becomes a big advantage with KV-cache prefix caching, since you can store twice as many concurrent chats in VRAM, which increases the cache hit rate for me substantially.
Overall, I got a ~4x real-world reduction in cost via 4-bit KV-cache + prefix caching. I was not expecting anything close to that before trying this.
Wondering if there's any super rough expected timeline on FP8 KV-cache + prefix caching/reuse? I.e. if planned for soon, or still likely a few months away at least.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
torch.compile
like speedupspip install