vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.36k stars 4.39k forks source link

[Usage]: Seeing perf regression using chunked_prefill on VLLM 0.5.4 #7604

Open jiahanc opened 2 months ago

jiahanc commented 2 months ago

Your current environment

Docker container following build_from _source instruction

How would you like to use vllm

I want to experiment how chunked_prefill can increase throughput but see perf regression when enabling chunked_prefill. I am testing on Llama-3-70B-Instruct-Gradient-1048k on 8 H100 SXM GPUs. I am benchmarking using /benchmark/benchmark_throughput.py and Tensor Parallel size 8 Is there any guidance how to choose max_batched_tokens properly? My command is :

python3 ./benchmarks/benchmark_throughput.py \
    --model hf_model_path \
    --tokenizer hf_model_path \
    --tensor-parallel-size 8 \
    --num-prompts 50 \
    --input-len 40000 \
    --output-len 256 \
    --enable-chunked-prefill \
    --max_batched_tokens 4096
    --trust-remote-code 

My benchmark result is: image

KuntaiDu commented 2 months ago

The rationale of picking max_batched_tokens: make it as large as possible, as long as your inter-token latency permits.

KuntaiDu commented 2 months ago

Also, chunked prefill does not improve offline throughput. The main goal of chunked prefill is:

  1. improve throughput w.r.t. a latency target
  2. reduce peak memory usage when handling large context
jiahanc commented 2 months ago

Thank you @KuntaiDu , will experiment with larger max_batched_tokens length