vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.86k stars 4.51k forks source link

[Feature]: Support for 4-bit KV Cache in paged-attention op #4025

Closed yukavio closed 6 months ago

yukavio commented 7 months ago

🚀 The feature, motivation and pitch

Summary

We would like to support the 4-bit KV cache for the decoding phase. The purpose of this feature is to reduce the GPU memory usage of the KV cache when processing long texts. By implementing a 4-bit KV cache, it would allow us to handle more and longer texts in situations where GPU memory is limited. Although VLLM currently has an implementation for fp8, utilizing int4 can further reduce GPU memory usage and allow for usage on devices that do not support the fp8 data format, such as A100.

methods

Regarding the specific implementation, we propose the development of three operations:

  1. Develop an operation to calculate the scale and zero point required for quantizing the KV cache and convert the existing fp16/bf16 KV cache to the int4 format.
  2. Provide support for storing 4-bit KV cache in the "write_to_paged_cache" operation.
  3. Enhance the paged-attention operation to support calculations with int4 KV cache:
    • Add optional inputs: k_scale, k_zeropoint, v_scale, and v_zeropoint to the paged-attention operation.
    • In the paged-attention kernel, if quantization-related parameters are detected, read the int4 KV cache stored in the GPU's global memory, convert it to fp16/bf16 representation, and perform subsequent calculations.

Alternatives

No response

Additional context

No response

smallsunsun1 commented 6 months ago

any update on this new feature? i am looking for this useful feature, In my scenario, I mainly use A10 cards, int4 kvcache can effectively increase my kv cache size.

yukavio commented 6 months ago

I am currently primarily focused on developing this issue, and I plan to start working on the development of the int4 kv cache from May.

yukavio commented 6 months ago

I'm very sorry, but due to some reasons, I need to put the development of this issue on hold. If there are others who are interested in this feature, we can reopen this issue.

houmie commented 6 months ago

The problem is that vLLM doesn't support ext2, which would have given us so much more options. So we are pretty much stuck with AWQ quantisation. Currently vLLM with Llama 3 70B doesn't fit properly on a 48 GB GPU despite a 4bit AWQ quantisation. I even enabled enforce_eager but it still runs out of memory sometimes. It is not stable enough.

Having this Q4 Cache will reduce the VRAM usage a bit more, which would be very helpful.

Of course the best solution is supporting exl2, then this feature could be de-priotised, but right now, it's difficult to justify vLLM, when aphrodite supports exl2 out of the box and fits properly on a 48 GB GPU.

houmie commented 6 months ago

Sorry forgot to tag you @yukavio Thanks

SherrySwift commented 1 month ago

Hi, is there any plan to support 4-bit KV Cache recently?

I am currently primarily focused on developing this issue, and I plan to start working on the development of the int4 kv cache from May.