Open dahwin opened 2 months ago
Yep, this is currently being considered and you can track its development in this (as of now) draft PR https://github.com/vllm-project/vllm/pull/8334.
"Thank you for the update, NickLucche. Do you have any estimated timeframe for when DynamicCache might be available for use in vLLM? Even a rough estimate would be helpful for planning purposes."
Sorry I don't know as I am not involved in the PR and I'm not a maintainer. But I can say the proposal and draft look very promising.
Is this not the same as automatic prefix caching?
https://docs.vllm.ai/en/latest/automatic_prefix_caching/apc.html
Is this not the same as automatic prefix caching?
https://docs.vllm.ai/en/latest/automatic_prefix_caching/apc.html
maybe yes
i will check later
by using DynamicCache llm don't need to re compute the previous prompt. it can re use previous prompt kv cache!
In gemini it's called context caching gemini & in anthropic it's called prompt caching
The DynamicCache is a mechanism
used to store and reuse the intermediate computations (key-value pairs) from previous iterations of the model's attention layers. This is particularly useful in scenarios where you're generating multiple responses in a conversation or processing a stream of related inputs
can i use DynamicCache mechanism in vllm?
I'm currently working with large language models and have been using the DynamicCache feature from the Hugging Face Transformers library for efficient multi-turn conversations. I'm interested in potentially using vllm for its performance benefits, but I have a question about feature parity:
Does vllm currently support an equivalent to the DynamicCache functionality?
If not, is this a feature that's on the roadmap or being considered for future implementation? Context: The primary benefit of DynamicCache is that it allows the model to avoid recomputing attention for previous prompts in a conversation. This significantly improves efficiency in multi-turn interactions. For reference, in the Transformers library, DynamicCache is used like this: example down bellow in transfomrers lib!
Ouptpus: