vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
23.26k stars 3.3k forks source link

[RFC]: Support sparse KV cache framework #5751

Open chizhang118 opened 4 weeks ago

chizhang118 commented 4 weeks ago

Motivation

For current large model inference, KV cache occupies a significant portion of GPU memory, so reducing the size of KV cache is an important direction for improvement. Recently, several papers have approached this issue from different angles, detailed comparison in the table, including:

When addressing the sparse KV cache issue, we have previously considered supporting quantization (VLLM has already implemented this), implementing quantization + outlier + residual like GEAR (not widely applicable as it requires generating outlier and residual for each token generation, which is costly), and implementing KV cache accumulation + appendix (not widely applicable as it requires models to be trained using the same method). Finally, the idea is to implement partial KV cache eviction, primarily aiming for generality and abstraction rather than being specific to one or two approaches. Considering that six of the sparse KV cache methods we found are based on evicting cache entries, this method is also suitable for modification as part of a framework to be integrated into VLLM.

Sparse KV Cache Workflow

First, let's clarify the required parameters, including:

The entire workflow includes:

Proposed Change

Modified files mainly include

PR

PR link: https://github.com/vllm-project/vllm/pull/5752

Design doc

https://docs.google.com/document/d/13_cpb31P9VOmPGa_tZ70s7z1vXGP_UenXf1WVuIppCk/

Feedback Period.

No response

CC List.

@simon-mo @youkaichao @zhuohan123 @cadedaniel @ywang96 @WoosukKwon @LiuXiaoxuanPKU

Any Other Things.

No response

robertgshaw2-neuralmagic commented 4 weeks ago

Very exciting!

thesues commented 4 weeks ago

how many gpu memory can be saved? do you have any benchmark data?

chizhang118 commented 4 weeks ago

how many gpu memory can be saved? do you have any benchmark data?

This depends on the Sparse KV cache compression ratio, from current paper, 20% compression ratio is a rough number, which means 80% reduction. Now is pending feedback from community, there is no benchmark data yet.

Zefan-Cai commented 3 weeks ago

Would you mind adding newly-proposed KV cache compression methods other than SnapKV and H2O? (i.e. PyramidKV)

chizhang118 commented 3 weeks ago

Would you mind adding newly-proposed KV cache compression methods other than SnapKV and H2O? (i.e. PyramidKV)

Sure, it should not be difficult to add based on the current framework. Will be on my radar. Thanks!

Zefan-Cai commented 3 weeks ago

Would you mind adding newly-proposed KV cache compression methods other than SnapKV and H2O? (i.e. PyramidKV)

Sure, it should not be difficult to add based on the current framework. Will be on my radar. Thanks!

Super cool! Thank you so much for your efforts!

simon-mo commented 3 weeks ago

This is exciting indeed. Few things

Zefan-Cai commented 3 weeks ago

Would you mind adding newly-proposed KV cache compression methods other than SnapKV and H2O? (i.e. PyramidKV)

Sure, it should not be difficult to add based on the current framework. Will be on my radar. Thanks!

Would you mind @ me when the new method is added? can't wait to have a try with vLLM!

dongxiaolong commented 2 weeks ago

https://github.com/microsoft/MInference Is there a combination of dynamic sparse attention and sparse KV cache? The vllm implementation is provided here

Zefan-Cai commented 2 weeks ago

https://github.com/microsoft/MInference Is there a combination of dynamic sparse attention and sparse KV cache? The vllm implementation is provided here

This repo does not provide sparse KV cache implementation in vLLM. They only provide HF ones.

dongxiaolong commented 2 weeks ago

https://github.com/microsoft/MInference Is there a combination of dynamic sparse attention and sparse KV cache? The vllm implementation is provided here

This repo does not provide sparse KV cache implementation in vLLM. They only provide HF ones.

for vLLM,

from vllm import LLM, SamplingParams

llm = LLM(model_name, max_num_seqs=1, enforce_eager=True, max_model_len=128000)

Patch MInference Module

+minference_patch = MInference("vllm", model_name) +llm = minference_patch(llm)

outputs = llm.generate(prompts, sampling_params) using only the kernel,

from minference import vertical_slash_sparse_attention, block_sparse_attention, streaming_forward

attn_output = vertical_slash_sparse_attention(q, k, v, vertical_topk, slash) attn_output = block_sparse_attention(q, k, v, topk) attn_output = streaming_forward(q, k, v, init_num, local_window_num) For more details, please refer to our Examples and Experiments. You can find more information about the dynamic compiler PIT in this paper and on GitHub.

Zefan-Cai commented 2 weeks ago

https://github.com/microsoft/MInference Is there a combination of dynamic sparse attention and sparse KV cache? The vllm implementation is provided here

This repo does not provide sparse KV cache implementation in vLLM. They only provide HF ones.

for vLLM,

from vllm import LLM, SamplingParams

  • from minference import MInference

llm = LLM(model_name, max_num_seqs=1, enforce_eager=True, max_model_len=128000)

Patch MInference Module

+minference_patch = MInference("vllm", model_name) +llm = minference_patch(llm)

outputs = llm.generate(prompts, sampling_params) using only the kernel,

from minference import vertical_slash_sparse_attention, block_sparse_attention, streaming_forward

attn_output = vertical_slash_sparse_attention(q, k, v, vertical_topk, slash) attn_output = block_sparse_attention(q, k, v, topk) attn_output = streaming_forward(q, k, v, init_num, local_window_num) For more details, please refer to our Examples and Experiments. You can find more information about the dynamic compiler PIT in this paper and on GitHub.

Are you an author of this repo? Your attached code seems not containing sparse kv cache implementation. and the Examples folder neither. Do I miss something?

dongxiaolong commented 2 weeks ago

https://github.com/microsoft/MInference Is there a combination of dynamic sparse attention and sparse KV cache? The vllm implementation is provided here

This repo does not provide sparse KV cache implementation in vLLM. They only provide HF ones.

for vLLM, from vllm import LLM, SamplingParams

  • from minference import MInference

llm = LLM(model_name, max_num_seqs=1, enforce_eager=True, max_model_len=128000)

Patch MInference Module

+minference_patch = MInference("vllm", model_name) +llm = minference_patch(llm) outputs = llm.generate(prompts, sampling_params) using only the kernel, from minference import vertical_slash_sparse_attention, block_sparse_attention, streaming_forward attn_output = vertical_slash_sparse_attention(q, k, v, vertical_topk, slash) attn_output = block_sparse_attention(q, k, v, topk) attn_output = streaming_forward(q, k, v, init_num, local_window_num) For more details, please refer to our Examples and Experiments. You can find more information about the dynamic compiler PIT in this paper and on GitHub.

Are you an author of this repo? Your attached code seems not containing sparse kv cache implementation. and the Examples folder neither. Do I miss something?

an

I am not the author of this repo. It's not sparse kv cache, it's sparse attention. Isn't there something in common?

PatchouliTIS commented 1 week ago

Great work! However, I noticed that your implementation only adapts for memory-friendly attention for xformers. Do you think it would be a lot of work to adapt it for Flash-Attention 2 with the current architecture? Or do you have plans to adapt for FlashAttention 2 in the future? https://github.com/vllm-project/vllm/blob/main/vllm/attention/backends/flash_attn.py

PatchouliTIS commented 1 week ago

btw, I tried long prompt in your framework, found that in long prompt scenario (approximately 3k tokens) the outputs make no sense just repeat some tokens to its outputs limit. I think maybe it is related to the sparse kv implementation?