vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.11k stars 4.55k forks source link

[Usage]: low GPU usage in qwen1.5 110b int4 inference #6239

Open liulfy opened 4 months ago

liulfy commented 4 months ago

Your current environment

When I try to inference qwen1.5 110b int4 (https://modelscope.cn/models/qwen/Qwen1.5-110B-Chat-GPTQ-Int4) with vllm(0.4.2) AsyncLlmEngine on A100 80G, I find the real batch size is just 2. I set the params as follows: gpu_memory_utilization = 0.95 max_parallel_loading_workers = 4 swap_space = 4 max_model_len = 1024 max_num_seqs = 8 I don't use beam search. The prompt is 330 tokens, and the output is about 5 tokens. When the QPS=1, each request takes about 0.45 seconds. When the QPS=2, each request takes about 0.70 seconds. When the QPS=4, each request takes about 1.4 seconds. When the QPS=8, each request takes about 2.9 seconds. And according to the output metrics(vllm.sequence.RequestMetrics), the real batch size is just 2. How to improve it? Thanks!

How would you like to use vllm

I want to run inference of a [specific model](put link here). I don't know how to integrate it with vllm.

github-actions[bot] commented 2 weeks ago

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!