Open jiahanc opened 2 months ago
The rationale of picking max_batched_tokens: make it as large as possible, as long as your inter-token latency permits.
Also, chunked prefill does not improve offline throughput. The main goal of chunked prefill is:
Thank you @KuntaiDu , will experiment with larger max_batched_tokens length
Your current environment
How would you like to use vllm
I want to experiment how chunked_prefill can increase throughput but see perf regression when enabling chunked_prefill. I am testing on Llama-3-70B-Instruct-Gradient-1048k on 8 H100 SXM GPUs. I am benchmarking using
/benchmark/benchmark_throughput.py
and Tensor Parallel size 8 Is there any guidance how to choosemax_batched_tokens
properly? My command is :My benchmark result is: