Closed yukavio closed 6 months ago
any update on this new feature? i am looking for this useful feature, In my scenario, I mainly use A10 cards, int4 kvcache can effectively increase my kv cache size.
I am currently primarily focused on developing this issue, and I plan to start working on the development of the int4 kv cache from May.
I'm very sorry, but due to some reasons, I need to put the development of this issue on hold. If there are others who are interested in this feature, we can reopen this issue.
The problem is that vLLM doesn't support ext2, which would have given us so much more options. So we are pretty much stuck with AWQ quantisation. Currently vLLM with Llama 3 70B doesn't fit properly on a 48 GB GPU despite a 4bit AWQ quantisation. I even enabled enforce_eager but it still runs out of memory sometimes. It is not stable enough.
Having this Q4 Cache will reduce the VRAM usage a bit more, which would be very helpful.
Of course the best solution is supporting exl2, then this feature could be de-priotised, but right now, it's difficult to justify vLLM, when aphrodite supports exl2 out of the box and fits properly on a 48 GB GPU.
Sorry forgot to tag you @yukavio Thanks
Hi, is there any plan to support 4-bit KV Cache recently?
I am currently primarily focused on developing this issue, and I plan to start working on the development of the int4 kv cache from May.
🚀 The feature, motivation and pitch
Summary
We would like to support the 4-bit KV cache for the decoding phase. The purpose of this feature is to reduce the GPU memory usage of the KV cache when processing long texts. By implementing a 4-bit KV cache, it would allow us to handle more and longer texts in situations where GPU memory is limited. Although VLLM currently has an implementation for fp8, utilizing int4 can further reduce GPU memory usage and allow for usage on devices that do not support the fp8 data format, such as A100.
methods
Regarding the specific implementation, we propose the development of three operations:
Alternatives
No response
Additional context
No response