vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.7k stars 4.66k forks source link

[Feature]: When generating multiple answers of the same prompt? #10099

Open liudan193 opened 2 weeks ago

liudan193 commented 2 weeks ago

🚀 The feature, motivation and pitch

When generating multiple answers of the same prompt, will vLLM store the cache of the prompt to speed up? Can you tell me more technical details? Thanks!

Alternatives

No response

Additional context

No response

Before submitting a new issue...

liudan193 commented 2 weeks ago

I see some related content in https://docs.vllm.ai/en/latest/automatic_prefix_caching/details.html, which states "For example, if a new request shares the system prompt with the previous request, the KV cache of the shared prompt can directly be used for the new request without recomputation. "

But this is only about "system prompt ", how about the user prompt?

liudan193 commented 2 weeks ago

Specifically, I want to know some details about the parameter n=args.generate_num. Will it re-compute the prompt?

sampling_params = SamplingParams(
        temperature=1.2, top_p=0.90, top_k=20, max_tokens=2048,
        repetition_penalty=1.2, n=args.generate_num, stop=stop_token
    )