Description it complains OOM because it always executes on GPU0 name: "vllm1" backend: "vllm"

The usage of device is deferred to the vLLM engine

instance_group [ { count: 1 kind: KIND_GPU gpus: [ 13 ] }, { count: 1 kind: KIND_GPU gpus: [ 14 ] }

Triton Information tritonserver --model-repository ./model_repository I0830 04:19:45.248723 783669 pinned_memory_manager.cc:277] "Pinned memory pool is created at '0x7d4e1cc10000' with size 268435456" I0830 04:19:45.248916 783669 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 0 with size 67108864" I0830 04:19:45.248925 783669 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 1 with size 67108864" I0830 04:19:45.248932 783669 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 2 with size 67108864" I0830 04:19:45.248938 783669 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 3 with size 67108864" I0830 04:19:45.248945 783669 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 4 with size 67108864" I0830 04:19:45.248951 783669 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 5 with size 67108864" I0830 04:19:45.248960 783669 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 6 with size 67108864" I0830 04:19:45.248967 783669 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 7 with size 67108864" I0830 04:19:45.248974 783669 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 8 with size 67108864" I0830 04:19:45.248981 783669 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 9 with size 67108864" I0830 04:19:45.248987 783669 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 10 with size 67108864" I0830 04:19:45.248993 783669 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 11 with size 67108864" I0830 04:19:45.248999 783669 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 12 with size 67108864" I0830 04:19:45.249005 783669 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 13 with size 67108864" I0830 04:19:45.249011 783669 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 14 with size 67108864" I0830 04:19:45.249019 783669 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 15 with size 67108864" W0830 04:19:45.346599 783669 model_lifecycle.cc:111] "ignore version directory '__pycache__' which fails to convert to integral number" I0830 04:19:45.346640 783669 model_lifecycle.cc:472] "loading: vllm1:1" I0830 04:19:50.820964 783669 python_be.cc:1912] "TRITONBACKEND_ModelInstanceInitialize: vllm1_0_0 (GPU device 13)" I0830 04:19:50.821002 783669 python_be.cc:1912] "TRITONBACKEND_ModelInstanceInitialize: vllm1_1_0 (GPU device 14)" I0830 04:19:54.624251 783669 model.py:197] "Detected KIND_GPU model instance, explicitly setting GPU device=14 for vllm1_14" INFO 08-30 12:19:54 llm_engine.py:88] Initializing an LLM engine with config: model='/home/zhaoanpu/lpf/chatglm3-6b', tokenizer='/home/zhaoanpu/lpf/chatglm3-6b', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, seed=0) I0830 04:19:54.686938 783669 model.py:197] "Detected KIND_GPU model instance, explicitly setting GPU device=13 for vllm1_13" INFO 08-30 12:19:54 llm_engine.py:88] Initializing an LLM engine with config: model='/home/zhaoanpu/lpf/chatglm3-6b', tokenizer='/home/zhaoanpu/lpf/chatglm3-6b', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, seed=0) WARNING 08-30 12:19:54 tokenizer.py:64] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead. WARNING 08-30 12:19:54 tokenizer.py:64] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead. INFO 08-30 12:20:05 llm_engine.py:422] # GPU blocks: 9134, # CPU blocks: 9362 I0830 04:20:07.023076 783669 pb_stub.cc:366] "Failed to initialize Python stub: OutOfMemoryError: CUDA out of memory. Tried to allocate 72.00 MiB. GPU 0 has a total capacty of 31.95 GiB of which 86.00 MiB is free. Of the allocated memory 14.26 GiB is allocated by PyTorch, and 302.00 KiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF\n\nAt:\n /opt/conda/lib/python3.10/site-packages/vllm/worker/cache_engine.py(124): allocate_gpu_cache\n /opt/conda/lib/python3.10/site-packages/vllm/worker/cache_engine.py(53): __init__\n /opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py(160): init_cache_engine\n /opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py(1099): _run_workers_in_batch\n /opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py(1125): _run_workers\n /opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py(442): _init_cache\n /opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py(132): __init__\n /opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py(416): _init_engine\n /opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py(368): __init__\n /opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py(675): from_engine_args\n /opt/tritonserver/backends/vllm/model.py(150): init_engine\n /opt/tritonserver/backends/vllm/model.py(111): initialize\n" INFO 08-30 12:20:07 llm_engine.py:422] # GPU blocks: 3191, # CPU blocks: 9362 I0830 04:20:07.288119 783669 pb_stub.cc:366] "Failed to initialize Python stub: OutOfMemoryError: CUDA out of memory. Tried to allocate 26.00 MiB. GPU 0 has a total capacty of 31.95 GiB of which 110.00 MiB is free. Of the allocated memory 12.70 GiB is allocated by PyTorch, and 46.32 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF\n\nAt:\n /opt/conda/lib/python3.10/site-packages/vllm/worker/cache_engine.py(124): allocate_gpu_cache\n /opt/conda/lib/python3.10/site-packages/vllm/worker/cache_engine.py(53): __init__\n /opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py(160): init_cache_engine\n /opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py(1099): _run_workers_in_batch\n /opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py(1125): _run_workers\n /opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py(442): _init_cache\n /opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py(132): __init__\n /opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py(416): _init_engine\n /opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py(368): __init__\n /opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py(675): from_engine_args\n /opt/tritonserver/backends/vllm/model.py(150): init_engine\n /opt/tritonserver/backends/vllm/model.py(111): initialize\n"

Are you using the Triton container or did you build it yourself? build it

` tritonserver --model-repository ./model_repository I0830 05:55:27.376363 824007 pinned_memory_manager.cc:277] "Pinned memory pool is created at '0x7db202cd5000' with size 268435456" I0830 05:55:27.376542 824007 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 0 with size 67108864" I0830 05:55:27.376550 824007 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 1 with size 67108864" I0830 05:55:27.376556 824007 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 2 with size 67108864" I0830 05:55:27.376562 824007 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 3 with size 67108864" I0830 05:55:27.376572 824007 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 4 with size 67108864" I0830 05:55:27.376581 824007 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 5 with size 67108864" I0830 05:55:27.376587 824007 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 6 with size 67108864" I0830 05:55:27.376592 824007 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 7 with size 67108864" I0830 05:55:27.376599 824007 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 8 with size 67108864" I0830 05:55:27.376605 824007 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 9 with size 67108864" I0830 05:55:27.376612 824007 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 10 with size 67108864" I0830 05:55:27.376618 824007 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 11 with size 67108864" I0830 05:55:27.376626 824007 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 12 with size 67108864" I0830 05:55:27.376632 824007 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 13 with size 67108864" I0830 05:55:27.376638 824007 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 14 with size 67108864" I0830 05:55:27.376646 824007 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 15 with size 67108864" W0830 05:55:27.507430 824007 model_lifecycle.cc:111] "ignore version directory 'pycache' which fails to convert to integral number" I0830 05:55:27.507467 824007 model_lifecycle.cc:472] "loading: vllm1:1" I0830 05:55:32.920230 824007 python_be.cc:1912] "TRITONBACKEND_ModelInstanceInitialize: vllm1_0_0 (GPU device 13)" I0830 05:55:32.920261 824007 python_be.cc:1912] "TRITONBACKEND_ModelInstanceInitialize: vllm1_1_0 (GPU device 14)" {'model_config': '{"name":"vllm1","platform":"","backend":"vllm","runtime":"model.py","version_policy":{"latest":{"num_versions":1}},"max_batch_size":0,"input":[{"name":"text_input","data_type":"TYPE_STRING","format":"FORMAT_NONE","dims":[1],"is_shape_tensor":false,"allow_ragged_batch":false,"optional":false},{"name":"stream","data_type":"TYPE_BOOL","format":"FORMAT_NONE","dims":[1],"is_shape_tensor":false,"allow_ragged_batch":false,"optional":true},{"name":"sampling_parameters","data_type":"TYPE_STRING","format":"FORMAT_NONE","dims":[1],"is_shape_tensor":false,"allow_ragged_batch":false,"optional":true},{"name":"exclude_input_in_output","data_type":"TYPE_BOOL","format":"FORMAT_NONE","dims":[1],"is_shape_tensor":false,"allow_ragged_batch":false,"optional":true}],"output":[{"name":"text_output","data_type":"TYPE_STRING","dims":[-1],"label_filename":"","is_shape_tensor":false}],"batch_input":[],"batch_output":[],"optimization":{"priority":"PRIORITY_DEFAULT","input_pinned_memory":{"enable":true},"output_pinned_memory":{"enable":true},"gather_kernel_buffer_threshold":0,"eager_batching":false},"instance_group":[{"name":"vllm1_0","kind":"KIND_GPU","count":1,"gpus":[13],"secondary_devices":[],"profile":[],"passive":false,"host_policy":""},{"name":"vllm1_1","kind":"KIND_GPU","count":1,"gpus":[14],"secondary_devices":[],"profile":[],"passive":false,"host_policy":""}],"default_model_filename":"","cc_model_filenames":{},"metric_tags":{},"parameters":{},"model_warmup":[],"model_transaction_policy":{"decoupled":true}}', 'model_instance_kind': 'GPU', 'model_instance_name': 'vllm1_1_0', 'model_instance_device_id': '14', 'model_repository': './model_repository/vllm1', 'model_version': '1', 'model_name': 'vllm1'} I0830 05:55:36.655476 824007 model.py:198] "Detected KIND_GPU model instance, explicitly setting GPU device=14 for vllm1_14" INFO 08-30 13:55:36 llm_engine.py:88] Initializing an LLM engine with config: model='/home/zhaoanpu/lpf/chatglm3-6b', tokenizer='/home/zhaoanpu/lpf/chatglm3-6b', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, seed=0) {'model_config': '{"name":"vllm1","platform":"","backend":"vllm","runtime":"model.py","version_policy":{"latest":{"num_versions":1}},"max_batch_size":0,"input":[{"name":"text_input","data_type":"TYPE_STRING","format":"FORMAT_NONE","dims":[1],"is_shape_tensor":false,"allow_ragged_batch":false,"optional":false},{"name":"stream","data_type":"TYPE_BOOL","format":"FORMAT_NONE","dims":[1],"is_shape_tensor":false,"allow_ragged_batch":false,"optional":true},{"name":"sampling_parameters","data_type":"TYPE_STRING","format":"FORMAT_NONE","dims":[1],"is_shape_tensor":false,"allow_ragged_batch":false,"optional":true},{"name":"exclude_input_in_output","data_type":"TYPE_BOOL","format":"FORMAT_NONE","dims":[1],"is_shape_tensor":false,"allow_ragged_batch":false,"optional":true}],"output":[{"name":"text_output","data_type":"TYPE_STRING","dims":[-1],"label_filename":"","is_shape_tensor":false}],"batch_input":[],"batch_output":[],"optimization":{"priority":"PRIORITY_DEFAULT","input_pinned_memory":{"enable":true},"output_pinned_memory":{"enable":true},"gather_kernel_buffer_threshold":0,"eager_batching":false},"instance_group":[{"name":"vllm1_0","kind":"KIND_GPU","count":1,"gpus":[13],"secondary_devices":[],"profile":[],"passive":false,"host_policy":""},{"name":"vllm1_1","kind":"KIND_GPU","count":1,"gpus":[14],"secondary_devices":[],"profile":[],"passive":false,"host_policy":""}],"default_model_filename":"","cc_model_filenames":{},"metric_tags":{},"parameters":{},"model_warmup":[],"model_transaction_policy":{"decoupled":true}}', 'model_instance_kind': 'GPU', 'model_instance_name': 'vllm1_0_0', 'model_instance_device_id': '13', 'model_repository': './model_repository/vllm1', 'model_version': '1', 'model_name': 'vllm1'} I0830 05:55:36.704326 824007 model.py:198] "Detected KIND_GPU model instance, explicitly setting GPU device=13 for vllm1_13" INFO 08-30 13:55:36 llm_engine.py:88] Initializing an LLM engine with config: model='/home/zhaoanpu/lpf/chatglm3-6b', tokenizer='/home/zhaoanpu/lpf/chatglm3-6b', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, seed=0) WARNING 08-30 13:55:36 tokenizer.py:64] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead. WARNING 08-30 13:55:36 tokenizer.py:64] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead. INFO 08-30 13:55:47 llm_engine.py:422] # GPU blocks: 6816, # CPU blocks: 9362 INFO 08-30 13:55:47 llm_engine.py:422] # GPU blocks: 6569, # CPU blocks: 9362 I0830 05:55:51.093848 824007 model_lifecycle.cc:838] "successfully loaded 'vllm1'" I0830 05:55:51.093944 824007 server.cc:604] +------------------+------+ | Repository Agent | Path | +------------------+------+ +------------------+------+

I0830 05:55:51.094002 824007 server.cc:631] +---------+-------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Backend | Path | Config | +---------+-------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+ | python | /opt/tritonserver/backends/python/libtriton_python.so | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} | | vllm | /opt/tritonserver/backends/vllm/model.py | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} | +---------+-------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0830 05:55:51.219742 824007 metrics.cc:877] "Collecting metrics for GPU 0: N/A" I0830 05:55:51.219770 824007 metrics.cc:877] "Collecting metrics for GPU 1: N/A" I0830 05:55:51.219780 824007 metrics.cc:877] "Collecting metrics for GPU 2: N/A" I0830 05:55:51.219790 824007 metrics.cc:877] "Collecting metrics for GPU 3: N/A" I0830 05:55:51.219798 824007 metrics.cc:877] "Collecting metrics for GPU 4: N/A" I0830 05:55:51.219809 824007 metrics.cc:877] "Collecting metrics for GPU 5: N/A" I0830 05:55:51.219817 824007 metrics.cc:877] "Collecting metrics for GPU 6: N/A" I0830 05:55:51.219827 824007 metrics.cc:877] "Collecting metrics for GPU 7: N/A" I0830 05:55:51.219839 824007 metrics.cc:877] "Collecting metrics for GPU 8: N/A" I0830 05:55:51.219848 824007 metrics.cc:877] "Collecting metrics for GPU 9: N/A" I0830 05:55:51.219857 824007 metrics.cc:877] "Collecting metrics for GPU 10: N/A" I0830 05:55:51.219866 824007 metrics.cc:877] "Collecting metrics for GPU 11: N/A" I0830 05:55:51.219875 824007 metrics.cc:877] "Collecting metrics for GPU 12: N/A" I0830 05:55:51.219883 824007 metrics.cc:877] "Collecting metrics for GPU 13: N/A" I0830 05:55:51.219892 824007 metrics.cc:877] "Collecting metrics for GPU 14: N/A" I0830 05:55:51.219900 824007 metrics.cc:877] "Collecting metrics for GPU 15: N/A" I0830 05:55:51.558238 824007 tritonserver.cc:2598] +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Option | Value | +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | server_id | triton | | server_version | 2.48.0 | | server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging | | model_repository_path[0] | ./model_repository | | model_control_mode | MODE_NONE | | strict_model_config | 0 | | model_config_name | | | rate_limit | OFF | | pinned_memory_pool_byte_size | 268435456 | | cuda_memory_pool_byte_size{0} | 67108864 | | cuda_memory_pool_byte_size{1} | 67108864 | | cuda_memory_pool_byte_size{2} | 67108864 | | cuda_memory_pool_byte_size{3} | 67108864 | | cuda_memory_pool_byte_size{4} | 67108864 | | cuda_memory_pool_byte_size{5} | 67108864 | | cuda_memory_pool_byte_size{6} | 67108864 | | cuda_memory_pool_byte_size{7} | 67108864 | | cuda_memory_pool_byte_size{8} | 67108864 | | cuda_memory_pool_byte_size{9} | 67108864 | | cuda_memory_pool_byte_size{10} | 67108864 | | cuda_memory_pool_byte_size{11} | 67108864 | | cuda_memory_pool_byte_size{12} | 67108864 | | cuda_memory_pool_byte_size{13} | 67108864 | | cuda_memory_pool_byte_size{14} | 67108864 | | cuda_memory_pool_byte_size{15} | 67108864 | | min_supported_compute_capability | 6.0 | | strict_readiness | 1 | | exit_timeout | 30 | | cache_enabled | 0 | +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0830 05:55:51.558614 824007 http_server.cc:4692] "Started HTTPService at 0.0.0.0:8000" I0830 05:55:51.599387 824007 http_server.cc:362] "Started Metrics Service at 0.0.0.0:8002"

this two instance all on GPU0：

triton-inference-server / server

When I use multiple Gpus, it complains OOM because it always executes on GPU0 #7578

The usage of device is deferred to the vLLM engine