Open zhurou603 opened 2 months ago
python3 -m vllm.entrypoints.openai.api_server --tensor-parallel-size 4 --served-model-name qwen_test --quantization awq_marlin --speculative-model-quantization gptq_marlin --speculative_model qwen/Qwen2-7B-Instruct-GPTQ-Int4 --num_speculative_tokens 4 --enable-prefix-caching --speculative-draft-tensor-parallel-size 4 --use-v2-block-manager --port 8000 --max-num-seqs 2 --model qwen/Qwen2-72B-Instruct-AWQ --dtype auto --api-key token-abc123
Scoring is supposed to be slow. You are running a 72B target model and a 7B draft model, the scoring process can be considered as the batched forward step of that 72B model.
Can you share the average per_token latency without spec decode? I guess the value should be slightly smaller than scoring_time_ms.
Proposal to improve performance
No response
Report of performance regression
INFO 09-11 12:41:50 spec_decode_worker.py:790] SpecDecodeWorker stage times: average_time_per_proposal_tok_ms=4.96 scoring_time_ms=54.92 verification_time_ms=1.20
The proportion of scoretime in decde is too large. The draft model only requires 5ms for each decode, but it takes 50ms for each score calculation.
Before submitting a new issue...