Maximize GPU utilization for increased throughput

vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

Apache License 2.0

29.79k stars 4.49k forks source link

I am using vLLM endpoint with OpenAI API to send concurrent requests to Llama2-7B model that's deployed on a single A100 GPU. Regardless of the values I set for --block-size, --swap-space, --max-num-seqs or --max-num-batched-tokens, the GPU utilization always fluctuates between 65%-75% (momentarily, also goes lower or higher than this range) .

Is there a way optimize GPU utilization and consequently enhance throughput?

I am testing with 400 prompts (i.e. 400 concurrent requests) and also scaling up to 1200 requests but to no effect.

Configuration:

vLLM==0.3.3 (pulled using latest docker image)
Model: Llama2-7B
Cuda 12.0
1 x A100 80GB GPU

Any help would be appreciated!

vllm-project / vllm

Maximize GPU utilization for increased throughput #3257