vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.16k stars 3.99k forks source link

[Usage]: Is DynamicCache supported in vllm? #8507

Open dahwin opened 6 days ago

dahwin commented 6 days ago

by using DynamicCache llm don't need to re compute the previous prompt. it can re use previous prompt kv cache!

In gemini it's called context caching gemini & in anthropic it's called prompt caching

The DynamicCache is a mechanism

used to store and reuse the intermediate computations (key-value pairs) from previous iterations of the model's attention layers. This is particularly useful in scenarios where you're generating multiple responses in a conversation or processing a stream of related inputs

can i use DynamicCache mechanism in vllm?

I'm currently working with large language models and have been using the DynamicCache feature from the Hugging Face Transformers library for efficient multi-turn conversations. I'm interested in potentially using vllm for its performance benefits, but I have a question about feature parity:

Does vllm currently support an equivalent to the DynamicCache functionality?

If not, is this a feature that's on the roadmap or being considered for future implementation? Context: The primary benefit of DynamicCache is that it allows the model to avoid recomputing attention for previous prompts in a conversation. This significantly improves efficiency in multi-turn interactions. For reference, in the Transformers library, DynamicCache is used like this: example down bellow in transfomrers lib!

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers.cache_utils import DynamicCache
import time

model_id = "Qwen/Qwen2-7B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map='auto')
tokenizer = AutoTokenizer.from_pretrained(model_id)

user_prompts = ["Give me a short introduction to large language models. Under 20 words.", 
                "Can you elaborate on that? Under 20 words! Answer always in English.",
               'this branch of ml is my favorite']

past_key_values = DynamicCache()

messages = []
for prompt in user_prompts:
    # Example usage for the second prompt
    start_time = time.time()
    messages.append({"role": "user", "content": prompt})
    inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)

    # Debug print
    print("Inputs shape:", inputs.shape)

    input_length = inputs.shape[1]  # Changed from inputs["input_ids"].shape[1]

    outputs = model.generate(
        inputs,  # Changed from **inputs
        do_sample=False, 
        max_new_tokens=256, 
        past_key_values=past_key_values
    )

    # Debug print
    print("Outputs shape:", outputs.shape)

    completion = tokenizer.decode(outputs[0, input_length:], skip_special_tokens=True)
    messages.append({"role": "assistant", "content": completion})
    print(f"User: {prompt}")
    print(f"Assistant: {completion}\n")

    end_time = time.time()
    print(f"Total time: {end_time - start_time:.2f} seconds")

Ouptpus:

User: Give me a short introduction to large language models. Under 20 words.
Assistant: Large language models (LLMs) are AI algorithms that process and generate human-like text based on vast training datasets.

Total time: 6.93 seconds

User: Can you elaborate on that? Under 20 words! Answer always in English.
Assistant: LLMs use complex algorithms to analyze patterns, learn context, and generate coherent text based on massive training datasets.

Total time: 1.93 seconds
User: This branch of ML is my favorite
Assistant: That's great to hear. Large language models have the potential to revolutionize many areas, from natural language processing to content creation and more.

Total time: 2.30 seconds
NickLucche commented 6 days ago

Yep, this is currently being considered and you can track its development in this (as of now) draft PR https://github.com/vllm-project/vllm/pull/8334.

dahwin commented 6 days ago

"Thank you for the update, NickLucche. Do you have any estimated timeframe for when DynamicCache might be available for use in vLLM? Even a rough estimate would be helpful for planning purposes."

NickLucche commented 6 days ago

Sorry I don't know as I am not involved in the PR and I'm not a maintainer. But I can say the proposal and draft look very promising.

hmellor commented 2 days ago

Is this not the same as automatic prefix caching?

https://docs.vllm.ai/en/latest/automatic_prefix_caching/apc.html