Open derpyhue opened 1 month ago
Block 1 Block 2 Block 3
[A gentle breeze stirred] [the leaves as children] [laughed in the distance]
Block 1: |<--- block tokens ---->|
Block 2: |<------- prefix ------>| |<--- block tokens --->|
Block 3: |<------------------ prefix -------------------->| |<--- block tokens ---->|
https://docs.vllm.ai/en/latest/automatic_prefix_caching/details.html
The current system is very handy, but i don't see a way to make it work with truncating tokens. As it currently needs the root block not to be changed else the rest will be changed too making the cache invalid. Maybe somehow truncating on block basis and shifting block 2 to block 1?
Will keep it open if someone has a idea. I will try to experiment but i'm afraid currently it would not be a option.
🚀 The feature, motivation and pitch
Currently when using the Automatic Prefix Caching when you truncate the input (for chat related generation) because of the context limit. The Automatic Prefix Caching will invalidate the cache because the first block will not correlate to the first block of the truncated input. If it would be possible to Context Shift like llama.cpp It would eliminate long TTFT times in my case using 24k context length.
Would this be possible to implement? Thank you for your time!
Alternatives
No response
Additional context
The hash is calculated based on the token IDs in the block and the hash of the previous block. When the input gets truncated due to the context length, the token IDs in the block change, which in turn changes the content hash. This causes the cache to be invalidated.