vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
22.46k stars 3.18k forks source link

StreamingLLM support? #1253

Open nivibilla opened 9 months ago

nivibilla commented 9 months ago

Hey,

This was a really interesting solution to the KV cache for long context. https://github.com/mit-han-lab/streaming-llm

I was wondering it could be implemented here. From the looks of things it doesn't change anything about the model itself, its more about how the KV cache is implemented.

They show that they can have coherent inference over millions of tokens

Thanks!

Guangxuan-Xiao commented 8 months ago

Hello vLLM Team,

I'd like to start by expressing my appreciation for your dedication in developing the vLLM framework. Tracking the project's evolution has been nothing short of exhilarating. I believe that integrating StreamingLLM would be an enhancement that would be of great value to many users.

To streamline this possible integration, I've curated a suggested approach based on the StreamingLLM's structure:

  1. Sliding Window KV Cache: Integrate the sliding window KV cache tailored for extensive generation tasks.

  2. Initial Token Persistence: Ensure that starting tokens' KV (like the first page of tokens) are consistently maintained within the current context window.

  3. Rotary Embedding Caching: It's crucial to cache the key states before applying the rotary embedding.

    Subsequently, the rotary positional embedding should be reapplied within this cache during the generation stage.

Adhering to this approach, I'm confident that StreamingLLM can be integrated into the vLLM framework. The entire community, myself included, is enthusiastic about the potential of this feature!

Best, Guangxuan

MichaelZhouwang commented 6 months ago

Is there any update on this feature update?

Kaiyang-Chen commented 3 months ago

Hi @Guangxuan-Xiao, I believe this feature is quite meaningful and I'm interested in helping to implement it. However, after some initial research, I feel that there isn't a straightforward and efficient method. This is because the values in the kvcache already include the pos_embedding calculated with the absolute position, combined with the query's embedding to perform the linear computation to obtain the kv cache. However, in the streamingllm, the relative positions of the tokens need to change.

The most naive solution would be only store kv value from QKVLinear before applying any position info, and then add position info to the whole context_window before the attention computation. This approach would require new CUDA kernels to handle such inputs and introduce recomputation for the current key cache, and also need extra memory for the intermediate key_state. And also, if we try to leverage paged kv cache, when token eviction or replacement happens, we need extra data copying or extra space to store the index in order to keep the token in fixed blocks and maintain their relative sequence. I think the throughput would decrease significantly. Also, such a change is completely different from the original memory management logic and would be a significant modification. I'm not sure if the team would accept making the codebase especially complex for this feature.

Hi Team @WoosukKwon @zhuohan123, do you guys have any thoughts/hints about how to gracefully integrate this feature?

halixness commented 1 month ago

+1