vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.78k stars 4.49k forks source link

[Usage]: Why does the GPU utilization fluctuate so much? #7858

Open wanghia opened 2 months ago

wanghia commented 2 months ago

Your current environment

I'm using 8 A100 GPUs with 40GB each to deploy LLaMA 3 70B. Under high concurrency, the average GPU utilization is only 50%. Why does the GPU utilization fluctuate so much? Here are my launch parameters.

'''CUDA_VISIBLE_DEVICES={gpu} RAY_memory_monitor_refresh_ms=0 \ python3 -u -m vllm.entrypoints.api_server --trust-remote-code \ --gpu-memory-utilization 0.98 \ --dtype float16 \ --enforce-eager \ --swap-space 16 --disable-log-requests --host 0.0.0.0 --port {port} --max-num-seqs 512 -tp {tp} --max-model-len 8192\ --model {model_path} '''

vllm==0.5.4

How would you like to use vllm

I want to run inference of a [specific model](put link here). I don't know how to integrate it with vllm.

Before submitting a new issue...

wxsms commented 2 months ago

is this same with #7768 ?