jy-yuan / KIVI

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
https://arxiv.org/abs/2402.02750
MIT License
120 stars 10 forks source link

Integrate KIVI into inference frameworks? #4

Closed darrenglow closed 4 days ago

darrenglow commented 1 month ago

This work is amazing. I want to know do you have any plans to integrate it in other inference frameworks, like vLLM?

zirui-ray-liu commented 1 month ago

Awesome idea! We are currently learning the low-level code of vLLM. One potential challenge is managing per-channel quantization while implementing continuous batching. Essentially, per-channel quantization involves compressing groups of 32 or 64 tokens' key caches at once. However, with continuous batching, as soon as one request finishes, another is immediately queued up. This means we need to take care of the corner case of per-channel quantization strategy that ensures tokens from separate requests are quantized independently, to avoid any mix-ups of different requests. We do not wanna mixup different requests because if you mix different requests together, you may potentially leak information from one request to another one.