Open liudan193 opened 2 weeks ago
I see some related content in https://docs.vllm.ai/en/latest/automatic_prefix_caching/details.html, which states "For example, if a new request shares the system prompt with the previous request, the KV cache of the shared prompt can directly be used for the new request without recomputation. "
But this is only about "system prompt ", how about the user prompt?
Specifically, I want to know some details about the parameter n=args.generate_num
. Will it re-compute the prompt?
sampling_params = SamplingParams(
temperature=1.2, top_p=0.90, top_k=20, max_tokens=2048,
repetition_penalty=1.2, n=args.generate_num, stop=stop_token
)
🚀 The feature, motivation and pitch
When generating multiple answers of the same prompt, will vLLM store the cache of the prompt to speed up? Can you tell me more technical details? Thanks!
Alternatives
No response
Additional context
No response
Before submitting a new issue...