[Bug]: The metrics have not improved.

zjjznw123 commented 1 month ago

Your current environment

VLLM is 0.5.0，A100 ， CUDA 12.1

🐛 Describe the bug

1、 CUDA_VISIBLE_DEVICES=1 python -m vllm.entrypoints.openai.api_server \ --model /home/Qwen1.5-1.8B-Chat \ --gpu-memory-utilization 0.5 \ --enable-prefix-caching

Metrics： num request: 2000 NUM ttft: 62.9838 ms tpot: 14.4647 ms avg_latency: 1567.14ms avg_throughput: 485.52tokens/s

2、 CUDA_VISIBLE_DEVICES=1 python -m vllm.entrypoints.openai.api_server \ --model /home/Qwen1.5-1.8B-Chat \ --gpu-memory-utilization 0.5 \ --enable-prefix-caching \ --speculative-model "[ngram]" \ --num-speculative-tokens 8 \ --use-v2-block-manager \ --ngram-prompt-lookup-max 8 \ --ngram-prompt-lookup-min 2 \

Metrics： num request: 2000 NUM ttft: 66.4831 ms tpot: 21.7066 ms avg_latency: 1360.00ms avg_throughput: 301.09tokens/s

The version of VLLM is 0.5.0, on an A100 machine with CUDA 12.1. The metrics for versions 1 and 2 are being compared. I believe that version 2 should have improvements in all metrics, but currently, they have decreased instead. Why is that, and how should the parameters be adjusted?

w013nad commented 1 month ago

Speculative decoding is heavily dependent on batch size and your input. If you have no input, there's nothing for the model to expand on while having the additional overhead of predicting future tokens.

Likewise, at higher batch sizes, the model is predicting future tokens for all requests and rejecting some of them. As you're at full GPU usage anyway, the model would've been better off predicting one token at a time for each request that will be accepted.

On my setup, I get 44 tok/s with method 1 for a single user, 40 tok/s for a single user with method 2 under worst case and ~150 tok/s under best case (I ask it to count to 100 with 1-100 already in context).

However, under heavy load, method 1 gets ~1800 tok/s whereas method 2 gets ~1600 tok/s.

zjjznw123 commented 1 month ago

Speculative decoding is heavily dependent on batch size and your input. If you have no input, there's nothing for the model to expand on while having the additional overhead of predicting future tokens.

Likewise, at higher batch sizes, the model is predicting future tokens for all requests and rejecting some of them. As you're at full GPU usage anyway, the model would've been better off predicting one token at a time for each request that will be accepted.

On my setup, I get 44 tok/s with method 1 for a single user, 40 tok/s for a single user with method 2 under worst case and ~150 tok/s under best case (I ask it to count to 100 with 1-100 already in context).

However, under heavy load, method 1 gets ~1800 tok/s whereas method 2 gets ~1600 tok/s.

My input data format is as follows:

json 复制代码 { "model": "Qwen1.5-1.8B-Chat", "messages": [ { "role": "user", "system": "You are a helpful assistant, please help me answer the following question." }, { "role": "user", "content": "What is the weather like today?" } ], "temperature": 0, "top_p": 1, "n": 1, "max_tokens": 512, "stream": true } I am using an online test with 10 concurrent requests, but it indeed didn't improve. Can you share your parameter data for reference?

vllm-project / vllm

[Bug]: The metrics have not improved. #6494

Your current environment

🐛 Describe the bug