triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.4k stars 1.49k forks source link

[ERROR] No available memory for the cache blocks. #7562

Open TheNha opened 3 months ago

TheNha commented 3 months ago

Description When I run 2 triton containers load 2 models by vllm backend with the same configuration, it works in GPU T4 and only takes ~10GB/15GB and total gpu_memory_utilization is 0.47*2=0.94/100 GPU.

PID USER DEV     TYPE  GPU        GPU MEM    CPU  HOST MEM Command
1880122 root   0  Compute   0%   5102MiB  33%     1%   5959MiB /opt/tritonserver/backends/python/triton_python_backend_stub /models_vllm_gemma2B/gemma_2B_1/1/model.py t
1894187 root   0  Compute   0%   5102MiB  33%     1%   5950MiB /opt/tritonserver/backends/python/triton_python_backend_stub /models_vllm_gemma2B_2/gemma_2B_1/1/model.py
1879648 root   0  Compute   0%    164MiB   1%     1%    420MiB tritonserver --model-repository=/models_vllm_gemma2B --http-port 12000 --grpc-port 12001 --metrics-port 1
1893772 root   0  Compute   0%    164MiB   1%     1%    412MiB tritonserver --model-repository=/models_vllm_gemma2B_2 --http-port 11000 --grpc-port 11001 --metrics-port

model.json config.

{
    "model":"/models/gemma-2b-it/",
    "disable_log_requests": true,
    "gpu_memory_utilization": 0.47,
    "enforce_eager": true,
    "dtype": "float16",
    "max_model_len": 64
}

Log of 1 container run triton vllm model INFO 08-22 15:08:23 model_runner.py:732] Loading model weights took 4.7384 GB INFO 08-22 15:08:24 gpu_executor.py:102] # GPU blocks: 260, # CPU blocks: 14563 I0822 15:08:26.990356 4152 model_lifecycle.cc:838] "successfully loaded 'gemma_2B_1'"

However, when I run 1 triton container and serve 2 model by vllm with the same configuration. I get the following error.

I0822 15:15:43.418942 4440 server.cc:674] 
+------------+---------+---------------------------------------------------------------------------------------------+
| Model      | Version | Status                                                                                      |
+------------+---------+---------------------------------------------------------------------------------------------+
| gemma_2B_1 | 1       | UNAVAILABLE: Internal: ValueError: No available memory for the cache blocks. Try increasing |
|            |         |  `gpu_memory_utilization` when initializing the engine.                                     |
|            |         |                                                                                             |
|            |         |   /usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py(369): raise_if_cache_size_i |
|            |         | nvalid                                                                                      |
|            |         |   /usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py(105): initialize_ca |
|            |         | che                                                                                         |
|            |         |   /usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py(214): initialize_cache      |
|            |         |   /usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py(552): _init_engin |
|            |         | e                                                                                           |
|            |         |   /usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py(471): from_engine |
|            |         | _args                                                                                       |
|            |         |   /usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py(263): __init__          |
|            |         |   /usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py(552): _init_engine |
| gemma_2B_2 | 1       | UNAVAILABLE: Internal: ValueError: No available memory for the cache blocks. Try increasing |
|            |         |  `gpu_memory_utilization` when initializing the engine.                                     |
|            |         |                                                                                             |
|            |         |   /usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py(369): raise_if_cache_size_i |
|            |         | nvalid                                                                                      |
|            |         |   /usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py(105): initialize_ca |
|            |         | che                                                                                         |
|            |         |   /usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py(214): initialize_cache      |
|            |         |   /usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py(552): _init_engin |
|            |         | e                                                                                           |
|            |         |   /usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py(471): from_engine |
|            |         | _args                                                                                       |
|            |         |   /usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py(263): __init__          |
|            |         |   /usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py(552): _init_engine |

Triton Information Triton version 22.06 Vllm version 0.5.4

Expected behavior Why I serve 2 models with 2 triton containers it works. But when I serve 2 models with 1 triton container it fails. Please advise me to resolve it. Thanks you.