[Bug]: Prefix cache with prompts dedupe

vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

Apache License 2.0

30.83k stars 4.69k forks source link

Your current environment

The output of `python collect_env.py`

```text Your output of `python collect_env.py` here ```

🐛 Describe the bug

Hi vLLM experts,

This might not be a bug unless I miss something.

If two identical new prompts are input at the same time, no preceding same prompt has been given so far and 0 cache hit. BlockSpaceManagerV1 will allocate the same blocks because of the same hashing. Then where is the logic to dedupe the computation, if not, will reshape_and_cache update the same kv cache slots simultaneously and lead to concurrent issue?

vllm-project / vllm

[Bug]: Prefix cache with prompts dedupe #7414

Your current environment

🐛 Describe the bug