[Bug]: Prefix Caching in BlockSpaceManagerV1 and BlockSpaceManagerV2 Increases Time to First Token(TTFT) and Slows Down System

Your current environment

PyTorch version: 2.3.1+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.27.6
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-3.10.0-1160.114.2.el7.x86_64-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.2.140
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA A100-SXM4-80GB
GPU 1: NVIDIA A100-SXM4-80GB
GPU 2: NVIDIA A100-SXM4-80GB
GPU 3: NVIDIA A100-SXM4-80GB
GPU 4: NVIDIA A100-SXM4-80GB
GPU 5: NVIDIA A100-SXM4-80GB
GPU 6: NVIDIA A100-SXM4-80GB
GPU 7: NVIDIA A100-SXM4-80GB

Nvidia driver version: 535.129.03

Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] mypy-protobuf==3.6.0
[pip3] numpy==1.22.0
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] onnx==1.14.0
[pip3] pytorch-quantization==2.1.2
[pip3] torch==2.3.1
[pip3] torch-tensorrt==0.0.0
[pip3] torchdata==0.7.0a0
[pip3] torchtext==0.16.0a0
[pip3] torchvision==0.18.1
[pip3] transformers==4.43.2
[pip3] triton==2.3.1
[pip3] vllm-nccl-cu12==2.18.1.0.4.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: v0.5.3
vLLM Build Flags:
CUDA Archs: 5.2 6.0 6.1 7.0 7.2 7.5 8.0 8.6 8.7 9.0+PTX; ROCm: Disabled; Neuron: Disabled

🐛 Describe the bug

When enabling prefix caching in BlockSpaceManagerV2, the system experiences a slowdown, particularly with an increased time to first token(ttft). The following are the results of my experiments with the Llama3 8B and 70B models. Initially, I suspected the tensor parallel logic, but since the 8B model, which I ran on a single GPU, also showed a slowdown, I believe the cause lies elsewhere. (By the way, prefix caching in BlockSpaceManagerV1 showed a 1.3-1.5x speedup in the time to first token metric with the same experiment settings.)

Experiment Settings

I tested with llmperf, which can vary the number of clients, affecting the maximum batch size in the vLLM worker.
The total input length was set to 1536, with a prefix prompt length of 512.

Llama 70B result

Model	Num GPUs	Num Clients	Block Manager	Prefix Caching	TTFT (mean)
Llama3-70B	4	16	v2	X	2415 ms
Llama3-70B	4	32	v2	X	4179 ms
Llama3-70B	4	64	v2	X	7670 ms
Llama3-70B	4	128	v2	X	12883 ms
Llama3-70B	4	16	v2	O	2755 ms
Llama3-70B	4	32	v2	O	4652 ms
Llama3-70B	4	64	v2	O	14344 ms
Llama3-70B	4	128	v2	O	25500 ms

Llama 8B result

Model	Num GPUs	Num Clients	Block Manager	Prefix Caching	TTFT (mean)
Llama3-8B	1	16	v2	X	841 ms
Llama3-8B	1	32	v2	X	1441 ms
Llama3-8B	1	64	v2	X	2619 ms
Llama3-8B	1	128	v2	X	4729 ms
Llama3-8B	1	16	v2	O	1962 ms
Llama3-8B	1	32	v2	O	8382 ms
Llama3-8B	1	64	v2	O	12665 ms
Llama3-8B	1	128	v2	O	22439 ms

vllm-project / vllm