[Performance]: INFO 09-11 12:41:50 spec_decode_worker.py:790] SpecDecodeWorker scoring_time_ms is slow

vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

https://docs.vllm.ai

Apache License 2.0

30.1k stars 4.55k forks source link

[Performance]: INFO 09-11 12:41:50 spec_decode_worker.py:790] SpecDecodeWorker scoring_time_ms is slow #8370

Open zhurou603 opened 2 months ago

zhurou603 commented 2 months ago

Proposal to improve performance

No response

Report of performance regression

INFO 09-11 12:41:50 spec_decode_worker.py:790] SpecDecodeWorker stage times: average_time_per_proposal_tok_ms=4.96 scoring_time_ms=54.92 verification_time_ms=1.20

The proportion of scoretime in decde is too large. The draft model only requires 5ms for each decode, but it takes 50ms for each score calculation.

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

zhurou603 commented 2 months ago

https://github.com/vllm-project/vllm/blob/main/vllm/spec_decode/batch_expansion.py#L46

zhurou603 commented 2 months ago

python3 -m vllm.entrypoints.openai.api_server --tensor-parallel-size 4 --served-model-name qwen_test --quantization awq_marlin --speculative-model-quantization gptq_marlin --speculative_model qwen/Qwen2-7B-Instruct-GPTQ-Int4 --num_speculative_tokens 4 --enable-prefix-caching --speculative-draft-tensor-parallel-size 4 --use-v2-block-manager --port 8000 --max-num-seqs 2 --model qwen/Qwen2-72B-Instruct-AWQ --dtype auto --api-key token-abc123

ShangmingCai commented 2 months ago

Scoring is supposed to be slow. You are running a 72B target model and a 7B draft model, the scoring process can be considered as the batched forward step of that 72B model.

Can you share the average per_token latency without spec decode? I guess the value should be slightly smaller than scoring_time_ms.