Description
When I run 2 triton containers load 2 models by vllm backend with the same configuration, it works in GPU T4 and only takes ~10GB/15GB and total gpu_memory_utilization is 0.47*2=0.94/100 GPU.
PID USER DEV TYPE GPU GPU MEM CPU HOST MEM Command
1880122 root 0 Compute 0% 5102MiB 33% 1% 5959MiB /opt/tritonserver/backends/python/triton_python_backend_stub /models_vllm_gemma2B/gemma_2B_1/1/model.py t
1894187 root 0 Compute 0% 5102MiB 33% 1% 5950MiB /opt/tritonserver/backends/python/triton_python_backend_stub /models_vllm_gemma2B_2/gemma_2B_1/1/model.py
1879648 root 0 Compute 0% 164MiB 1% 1% 420MiB tritonserver --model-repository=/models_vllm_gemma2B --http-port 12000 --grpc-port 12001 --metrics-port 1
1893772 root 0 Compute 0% 164MiB 1% 1% 412MiB tritonserver --model-repository=/models_vllm_gemma2B_2 --http-port 11000 --grpc-port 11001 --metrics-port
Log of 1 container run triton vllm model
INFO 08-22 15:08:23 model_runner.py:732] Loading model weights took 4.7384 GB
INFO 08-22 15:08:24 gpu_executor.py:102] # GPU blocks: 260, # CPU blocks: 14563
I0822 15:08:26.990356 4152 model_lifecycle.cc:838] "successfully loaded 'gemma_2B_1'"
However, when I run 1 triton container and serve 2 model by vllm with the same configuration. I get the following error.
Triton Information
Triton version 22.06
Vllm version 0.5.4
Expected behavior
Why I serve 2 models with 2 triton containers it works. But when I serve 2 models with 1 triton container it fails.
Please advise me to resolve it. Thanks you.
Description When I run 2 triton containers load 2 models by vllm backend with the same configuration, it works in GPU T4 and only takes ~10GB/15GB and total gpu_memory_utilization is 0.47*2=0.94/100 GPU.
model.json config.
Log of 1 container run triton vllm model INFO 08-22 15:08:23 model_runner.py:732] Loading model weights took 4.7384 GB INFO 08-22 15:08:24 gpu_executor.py:102] # GPU blocks: 260, # CPU blocks: 14563 I0822 15:08:26.990356 4152 model_lifecycle.cc:838] "successfully loaded 'gemma_2B_1'"
However, when I run 1 triton container and serve 2 model by vllm with the same configuration. I get the following error.
Triton Information Triton version 22.06 Vllm version 0.5.4
Expected behavior Why I serve 2 models with 2 triton containers it works. But when I serve 2 models with 1 triton container it fails. Please advise me to resolve it. Thanks you.