vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.16k stars 3.83k forks source link

[Bug]: Prefix Caching in BlockSpaceManagerV1 and BlockSpaceManagerV2 Increases Time to First Token(TTFT) and Slows Down System #6923

Open llsj14 opened 1 month ago

llsj14 commented 1 month ago

Your current environment

PyTorch version: 2.3.1+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.27.6
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-3.10.0-1160.114.2.el7.x86_64-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.2.140
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA A100-SXM4-80GB
GPU 1: NVIDIA A100-SXM4-80GB
GPU 2: NVIDIA A100-SXM4-80GB
GPU 3: NVIDIA A100-SXM4-80GB
GPU 4: NVIDIA A100-SXM4-80GB
GPU 5: NVIDIA A100-SXM4-80GB
GPU 6: NVIDIA A100-SXM4-80GB
GPU 7: NVIDIA A100-SXM4-80GB

Nvidia driver version: 535.129.03

Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] mypy-protobuf==3.6.0
[pip3] numpy==1.22.0
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] onnx==1.14.0
[pip3] pytorch-quantization==2.1.2
[pip3] torch==2.3.1
[pip3] torch-tensorrt==0.0.0
[pip3] torchdata==0.7.0a0
[pip3] torchtext==0.16.0a0
[pip3] torchvision==0.18.1
[pip3] transformers==4.43.2
[pip3] triton==2.3.1
[pip3] vllm-nccl-cu12==2.18.1.0.4.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: v0.5.3
vLLM Build Flags:
CUDA Archs: 5.2 6.0 6.1 7.0 7.2 7.5 8.0 8.6 8.7 9.0+PTX; ROCm: Disabled; Neuron: Disabled

🐛 Describe the bug

When enabling prefix caching in BlockSpaceManagerV2, the system experiences a slowdown, particularly with an increased time to first token(ttft). The following are the results of my experiments with the Llama3 8B and 70B models. Initially, I suspected the tensor parallel logic, but since the 8B model, which I ran on a single GPU, also showed a slowdown, I believe the cause lies elsewhere. (By the way, prefix caching in BlockSpaceManagerV1 showed a 1.3-1.5x speedup in the time to first token metric with the same experiment settings.)

Experiment Settings

Llama 70B result

Model Num GPUs Num Clients Block Manager Prefix Caching TTFT (mean)
Llama3-70B 4 16 v2 X 2415 ms
Llama3-70B 4 32 v2 X 4179 ms
Llama3-70B 4 64 v2 X 7670 ms
Llama3-70B 4 128 v2 X 12883 ms
Llama3-70B 4 16 v2 O 2755 ms
Llama3-70B 4 32 v2 O 4652 ms
Llama3-70B 4 64 v2 O 14344 ms
Llama3-70B 4 128 v2 O 25500 ms

Llama 8B result

Model Num GPUs Num Clients Block Manager Prefix Caching TTFT (mean)
Llama3-8B 1 16 v2 X 841 ms
Llama3-8B 1 32 v2 X 1441 ms
Llama3-8B 1 64 v2 X 2619 ms
Llama3-8B 1 128 v2 X 4729 ms
Llama3-8B 1 16 v2 O 1962 ms
Llama3-8B 1 32 v2 O 8382 ms
Llama3-8B 1 64 v2 O 12665 ms
Llama3-8B 1 128 v2 O 22439 ms
Ximingwang-09 commented 1 month ago

I meet the problem when enabling prefix caching in BlockSpaceManagerV2 ...