vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.77k stars 4.67k forks source link

[Performance]: the performance with chunked-prefill-enabled is lower than default #6150

Open BestKuan opened 4 months ago

BestKuan commented 4 months ago

I tested vllm benchmark_throughput.py and finded that the performance with chunked-prefill-enabled is lower than default, how can I deal this problem

No response

Your current environment (if you think it is necessary)

export CUDA_VISIBLE_DEVICES=0
python3 ./benchmarks/benchmark_throughput.py \
    --model /home/workspace/chatglm3-6b/ \
    --tokenizer /home/workspace/chatglm3-6b/ \
    --num-prompts 16 \
    --input-len 1024 \
    --output-len 256 \
    --enable-chunked-prefill \
    --trust-remote-code 

Does the params set ok?

pipul commented 2 months ago

chunked_prefill_enable = False

INFO 09-01 12:46:11 async_llm_engine.py:268] 7cbe74f5c90c4a95954ae8b87d36a3c6 finished E2E: 0.29664182662963867, TTFT: 0.29621362686157227, TBT: 0.00042819976806640625, TIQ: 0.001392364501953125 INFO 09-01 12:46:15 async_llm_engine.py:268] 9bbc02b5dc904963a915612fc8951d0a finished E2E: 0.29630255699157715, TTFT: 0.2959132194519043, TBT: 0.00038933753967285156, TIQ: 0.0011632442474365234

chunked_prefill_enable = True INFO 09-01 12:52:55 async_llm_engine.py:268] f4ce2ce1237146b79df1e698d6d70582 finished E2E: 0.3303070068359375, TTFT: 0.32995128631591797, TBT: 0.00035572052001953125, TIQ: 0.0012929439544677734 INFO 09-01 12:53:00 async_llm_engine.py:268] b03a99b525da4bfd8ef6ef1928030a6b finished E2E: 0.3486812114715576, TTFT: 0.3483591079711914, TBT: 0.00032210350036621094, TIQ: 0.0012426376342773438

when enable the chunked prefill, TTFT 296ms -> 330ms

zhaotyer commented 1 month ago

me too!