vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.77k stars 3.92k forks source link

[Feature]: Automatic Prefix Caching and Truncating. Possibilty for Context Shifting. #7560

Open derpyhue opened 1 month ago

derpyhue commented 1 month ago

🚀 The feature, motivation and pitch

Currently when using the Automatic Prefix Caching when you truncate the input (for chat related generation) because of the context limit. The Automatic Prefix Caching will invalidate the cache because the first block will not correlate to the first block of the truncated input. If it would be possible to Context Shift like llama.cpp It would eliminate long TTFT times in my case using 24k context length.

Would this be possible to implement? Thank you for your time!

Alternatives

No response

Additional context

The hash is calculated based on the token IDs in the block and the hash of the previous block. When the input gets truncated due to the context length, the token IDs in the block change, which in turn changes the content hash. This causes the cache to be invalidated.

derpyhue commented 1 month ago
                    Block 1                  Block 2                  Block 3
         [A gentle breeze stirred] [the leaves as children] [laughed in the distance]
Block 1: |<--- block tokens ---->|
Block 2: |<------- prefix ------>| |<--- block tokens --->|
Block 3: |<------------------ prefix -------------------->| |<--- block tokens ---->|

https://docs.vllm.ai/en/latest/automatic_prefix_caching/details.html

The current system is very handy, but i don't see a way to make it work with truncating tokens. As it currently needs the root block not to be changed else the rest will be changed too making the cache invalid. Maybe somehow truncating on block basis and shifting block 2 to block 1?

Will keep it open if someone has a idea. I will try to experiment but i'm afraid currently it would not be a option.