I'm using 8 A100 GPUs with 40GB each to deploy LLaMA 3 70B. Under high concurrency, the average GPU utilization is only 50%. Why does the GPU utilization fluctuate so much? Here are my launch parameters.
I want to run inference of a [specific model](put link here). I don't know how to integrate it with vllm.
Before submitting a new issue...
[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Your current environment
I'm using 8 A100 GPUs with 40GB each to deploy LLaMA 3 70B. Under high concurrency, the average GPU utilization is only 50%. Why does the GPU utilization fluctuate so much? Here are my launch parameters.
'''CUDA_VISIBLE_DEVICES={gpu} RAY_memory_monitor_refresh_ms=0 \ python3 -u -m vllm.entrypoints.api_server --trust-remote-code \ --gpu-memory-utilization 0.98 \ --dtype float16 \ --enforce-eager \ --swap-space 16 --disable-log-requests --host 0.0.0.0 --port {port} --max-num-seqs 512 -tp {tp} --max-model-len 8192\ --model {model_path} '''
vllm==0.5.4
How would you like to use vllm
I want to run inference of a [specific model](put link here). I don't know how to integrate it with vllm.
Before submitting a new issue...