vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.83k stars 4.69k forks source link

[Bug]: Prefix cache with prompts dedupe #7414

Open zhengy001 opened 3 months ago

zhengy001 commented 3 months ago

Your current environment

The output of `python collect_env.py` ```text Your output of `python collect_env.py` here ```

🐛 Describe the bug

Hi vLLM experts,

This might not be a bug unless I miss something.

If two identical new prompts are input at the same time, no preceding same prompt has been given so far and 0 cache hit. BlockSpaceManagerV1 will allocate the same blocks because of the same hashing. Then where is the logic to dedupe the computation, if not, will reshape_and_cache update the same kv cache slots simultaneously and lead to concurrent issue?

github-actions[bot] commented 2 weeks ago

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!