vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.79k stars 4.49k forks source link

Maximize GPU utilization for increased throughput #3257

Open rrichajalota opened 8 months ago

rrichajalota commented 8 months ago

I am using vLLM endpoint with OpenAI API to send concurrent requests to Llama2-7B model that's deployed on a single A100 GPU. Regardless of the values I set for --block-size, --swap-space, --max-num-seqs or --max-num-batched-tokens, the GPU utilization always fluctuates between 65%-75% (momentarily, also goes lower or higher than this range) .

Is there a way optimize GPU utilization and consequently enhance throughput?

I am testing with 400 prompts (i.e. 400 concurrent requests) and also scaling up to 1200 requests but to no effect.

Configuration:

vLLM==0.3.3 (pulled using latest docker image)
Model: Llama2-7B
Cuda 12.0
1 x A100 80GB GPU

Any help would be appreciated!

github-actions[bot] commented 1 week ago

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!