Open liulfy opened 4 months ago
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
Your current environment
When I try to inference qwen1.5 110b int4 (https://modelscope.cn/models/qwen/Qwen1.5-110B-Chat-GPTQ-Int4) with vllm(0.4.2) AsyncLlmEngine on A100 80G, I find the real batch size is just 2. I set the params as follows: gpu_memory_utilization = 0.95 max_parallel_loading_workers = 4 swap_space = 4 max_model_len = 1024 max_num_seqs = 8 I don't use beam search. The prompt is 330 tokens, and the output is about 5 tokens. When the QPS=1, each request takes about 0.45 seconds. When the QPS=2, each request takes about 0.70 seconds. When the QPS=4, each request takes about 1.4 seconds. When the QPS=8, each request takes about 2.9 seconds. And according to the output metrics(vllm.sequence.RequestMetrics), the real batch size is just 2. How to improve it? Thanks!
How would you like to use vllm
I want to run inference of a [specific model](put link here). I don't know how to integrate it with vllm.