Integrate KIVI into inference frameworks?

Awesome idea! We are currently learning the low-level code of vLLM. One potential challenge is managing per-channel quantization while implementing continuous batching. Essentially, per-channel quantization involves compressing groups of 32 or 64 tokens' key caches at once. However, with continuous batching, as soon as one request finishes, another is immediately queued up. This means we need to take care of the corner case of per-channel quantization strategy that ensures tokens from separate requests are quantized independently, to avoid any mix-ups of different requests. We do not wanna mixup different requests because if you mix different requests together, you may potentially leak information from one request to another one.

jy-yuan / KIVI

Integrate KIVI into inference frameworks? #4