vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.1k stars 4.55k forks source link

[Performance]: INFO 09-11 12:41:50 spec_decode_worker.py:790] SpecDecodeWorker scoring_time_ms is slow #8370

Open zhurou603 opened 2 months ago

zhurou603 commented 2 months ago

Proposal to improve performance

No response

Report of performance regression

INFO 09-11 12:41:50 spec_decode_worker.py:790] SpecDecodeWorker stage times: average_time_per_proposal_tok_ms=4.96 scoring_time_ms=54.92 verification_time_ms=1.20

The proportion of scoretime in decde is too large. The draft model only requires 5ms for each decode, but it takes 50ms for each score calculation.

Before submitting a new issue...

zhurou603 commented 2 months ago

https://github.com/vllm-project/vllm/blob/main/vllm/spec_decode/batch_expansion.py#L46

zhurou603 commented 2 months ago

python3 -m vllm.entrypoints.openai.api_server --tensor-parallel-size 4 --served-model-name qwen_test --quantization awq_marlin --speculative-model-quantization gptq_marlin --speculative_model qwen/Qwen2-7B-Instruct-GPTQ-Int4 --num_speculative_tokens 4 --enable-prefix-caching --speculative-draft-tensor-parallel-size 4 --use-v2-block-manager --port 8000 --max-num-seqs 2 --model qwen/Qwen2-72B-Instruct-AWQ --dtype auto --api-key token-abc123

ShangmingCai commented 2 months ago

Scoring is supposed to be slow. You are running a 72B target model and a 7B draft model, the scoring process can be considered as the batched forward step of that 72B model.

Can you share the average per_token latency without spec decode? I guess the value should be slightly smaller than scoring_time_ms.