triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.32k stars 1.48k forks source link

triton 24.08: "Poll failed for model directory 'ensemble': unexpected platform type 'ensemble' for ensemble" #7632

Closed xiejibing closed 1 month ago

xiejibing commented 1 month ago

Description "Poll failed for model directory 'ensemble': unexpected platform type 'ensemble' for ensemble"

Triton Information tritonserver:24.08 docker image from https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/rel-24-08.html#rel-24-08

Are you using the Triton container or did you build it yourself?

No, use officially provided container.

To Reproduce Steps to reproduce the behavior.

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).

models:

config file:

name: "ensemble"
platform: "ensemble"
input {
  name: "chat_input"
  data_type: TYPE_STRING
  dims: 1
}
output {
  name: "text_output"
  data_type: TYPE_STRING
  dims: 1
}
ensemble_scheduling {
  step {
    model_name: "pre"
    model_version: -1
    input_map {
      key: "chat_input"
      value: "chat_input"
    }
    output_map {
      key: "text_input"
      value: "text_input"
    }
  }
  step {
    model_name: "main_app"
    model_version: -1
    input_map {
      key: "text_input"
      value: "text_input"
    }
    output_map {
      key: "text_output"
      value: "text_output"
    }
  }
}

triton server logs

I0919 14:14:03.169728 108294 cache_manager.cc:480] "Create CacheManager with cache_dir: '/opt/tritonserver/caches'"
I0919 14:14:03.499634 108294 pinned_memory_manager.cc:277] "Pinned memory pool is created at '0x7fda58000000' with size 268435456"
I0919 14:14:03.501032 108294 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 0 with size 67108864"
I0919 14:14:03.602352 108294 model_config_utils.cc:716] "Server side auto-completed config: "
name: "ensemble"
platform: "ensemble"
input {
  name: "chat_input"
  data_type: TYPE_STRING
  dims: 1
}
output {
  name: "text_output"
  data_type: TYPE_STRING
  dims: 1
}
ensemble_scheduling {
  step {
    model_name: "pre"
    model_version: -1
    input_map {
      key: "chat_input"
      value: "chat_input"
    }
    output_map {
      key: "text_input"
      value: "text_input"
    }
  }
  step {
    model_name: "main_app"
    model_version: -1
    input_map {
      key: "text_input"
      value: "text_input"
    }
    output_map {
      key: "text_output"
      value: "text_output"
    }
  }
}

E0919 14:14:03.603528 108294 model_repository_manager.cc:1460] "Poll failed for model directory 'ensemble': unexpected platform type 'ensemble' for ensemble"
I0919 14:14:03.705022 108294 model_config_utils.cc:716] "Server side auto-completed config: "
name: "main_app"
input {
  name: "text_input"
  data_type: TYPE_STRING
  dims: 1
}
input {
  name: "stream"
  data_type: TYPE_BOOL
  dims: 1
}
input {
  name: "sampling_parameters"
  data_type: TYPE_STRING
  dims: 1
}
input {
  name: "exclude_input_in_output"
  data_type: TYPE_BOOL
  dims: 1
}
output {
  name: "text_output"
  data_type: TYPE_STRING
  dims: 1
}
instance_group {
  count: 1
  kind: KIND_MODEL
}
backend: "vllm"

I0919 14:14:03.859925 108294 model_config_utils.cc:716] "Server side auto-completed config: "
name: "pre"
input {
  name: "chat_input"
  data_type: TYPE_STRING
  dims: 1
}
output {
  name: "text_input"
  data_type: TYPE_STRING
  dims: 1
}
instance_group {
  count: 1
}
default_model_filename: "model.py"
backend: "python"

I0919 14:14:03.878010 108294 model_lifecycle.cc:472] "loading: main_app:0"
I0919 14:14:03.883757 108294 backend_model.cc:503] "Adding default backend config setting: default-max-batch-size,4"
I0919 14:14:03.883862 108294 shared_library.cc:112] "OpenLibraryHandle: /opt/tritonserver/backends/python/libtriton_python.so"
I0919 14:14:03.885133 108294 python_be.cc:1618] "'vllm' TRITONBACKEND API version: 1.19"
I0919 14:14:03.885193 108294 python_be.cc:1640] "backend configuration:\n{\"cmdline\":{\"auto-complete-config\":\"true\",\"backend-directory\":\"/opt/tritonserver/backends\",\"min-compute-capability\":\"6.000000\",\"default-max-batch-size\":\"4\"}}"
I0919 14:14:03.885248 108294 python_be.cc:1778] "Shared memory configuration is shm-default-byte-size=1048576,shm-growth-byte-size=1048576,stub-timeout-seconds=30"
I0919 14:14:03.885402 108294 python_be.cc:2075] "TRITONBACKEND_GetBackendAttribute: setting attributes"
I0919 14:14:03.886174 108294 python_be.cc:1879] "TRITONBACKEND_ModelInitialize: main_app (version 0)"
I0919 14:14:03.886688 108294 model_config_utils.cc:1941] "ModelConfig 64-bit fields:"
I0919 14:14:03.886730 108294 model_config_utils.cc:1943] "\tModelConfig::dynamic_batching::default_priority_level"
I0919 14:14:03.886762 108294 model_config_utils.cc:1943] "\tModelConfig::dynamic_batching::default_queue_policy::default_timeout_microseconds"
I0919 14:14:03.886792 108294 model_config_utils.cc:1943] "\tModelConfig::dynamic_batching::max_queue_delay_microseconds"
I0919 14:14:03.886819 108294 model_config_utils.cc:1943] "\tModelConfig::dynamic_batching::priority_levels"
I0919 14:14:03.886847 108294 model_config_utils.cc:1943] "\tModelConfig::dynamic_batching::priority_queue_policy::key"
I0919 14:14:03.886874 108294 model_config_utils.cc:1943] "\tModelConfig::dynamic_batching::priority_queue_policy::value::default_timeout_microseconds"
I0919 14:14:03.886901 108294 model_config_utils.cc:1943] "\tModelConfig::ensemble_scheduling::step::model_version"
I0919 14:14:03.886928 108294 model_config_utils.cc:1943] "\tModelConfig::input::dims"
I0919 14:14:03.886969 108294 model_config_utils.cc:1943] "\tModelConfig::input::reshape::shape"
I0919 14:14:03.886997 108294 model_config_utils.cc:1943] "\tModelConfig::instance_group::secondary_devices::device_id"
I0919 14:14:03.887023 108294 model_config_utils.cc:1943] "\tModelConfig::model_warmup::inputs::value::dims"
I0919 14:14:03.887049 108294 model_config_utils.cc:1943] "\tModelConfig::optimization::cuda::graph_spec::graph_lower_bound::input::value::dim"
I0919 14:14:03.887075 108294 model_config_utils.cc:1943] "\tModelConfig::optimization::cuda::graph_spec::input::value::dim"
I0919 14:14:03.887103 108294 model_config_utils.cc:1943] "\tModelConfig::output::dims"
I0919 14:14:03.887131 108294 model_config_utils.cc:1943] "\tModelConfig::output::reshape::shape"
I0919 14:14:03.887158 108294 model_config_utils.cc:1943] "\tModelConfig::sequence_batching::direct::max_queue_delay_microseconds"
I0919 14:14:03.887185 108294 model_config_utils.cc:1943] "\tModelConfig::sequence_batching::max_sequence_idle_microseconds"
I0919 14:14:03.887211 108294 model_config_utils.cc:1943] "\tModelConfig::sequence_batching::oldest::max_queue_delay_microseconds"
I0919 14:14:03.887238 108294 model_config_utils.cc:1943] "\tModelConfig::sequence_batching::state::dims"
I0919 14:14:03.887265 108294 model_config_utils.cc:1943] "\tModelConfig::sequence_batching::state::initial_state::dims"
I0919 14:14:03.887292 108294 model_config_utils.cc:1943] "\tModelConfig::version_policy::specific::versions"
I0919 14:14:03.888500 108294 stub_launcher.cc:385] "Starting Python backend stub:  exec /opt/tritonserver/backends/python/triton_python_backend_stub /data/ebay/notebooks/jibxie/server3/triton_repo/main_app/0/model.py triton_vllm_backend_shm_region_d3235824-abbd-469b-9460-40fc9b6c9ff3 1048576 1048576 108294 /opt/tritonserver/backends/python 336 main_app /opt/tritonserver/backends/vllm"
I0919 14:14:03.893620 108294 model_lifecycle.cc:472] "loading: pre:0"
I0919 14:14:03.896897 108294 backend_model.cc:503] "Adding default backend config setting: default-max-batch-size,4"
I0919 14:14:03.897011 108294 shared_library.cc:112] "OpenLibraryHandle: /opt/tritonserver/backends/python/libtriton_python.so"
I0919 14:14:03.897322 108294 python_be.cc:1618] "'python' TRITONBACKEND API version: 1.19"
I0919 14:14:03.897368 108294 python_be.cc:1640] "backend configuration:\n{\"cmdline\":{\"auto-complete-config\":\"true\",\"backend-directory\":\"/opt/tritonserver/backends\",\"min-compute-capability\":\"6.000000\",\"default-max-batch-size\":\"4\"}}"
I0919 14:14:03.897410 108294 python_be.cc:1778] "Shared memory configuration is shm-default-byte-size=1048576,shm-growth-byte-size=1048576,stub-timeout-seconds=30"
I0919 14:14:03.897527 108294 python_be.cc:2075] "TRITONBACKEND_GetBackendAttribute: setting attributes"
I0919 14:14:03.898255 108294 python_be.cc:1879] "TRITONBACKEND_ModelInitialize: pre (version 0)"
I0919 14:14:03.899716 108294 stub_launcher.cc:385] "Starting Python backend stub:  exec /opt/tritonserver/backends/python/triton_python_backend_stub /data/ebay/notebooks/jibxie/server3/triton_repo/pre/0/model.py triton_python_backend_shm_region_01681891-9ea9-4ce2-801d-1cf1320ce6d1 1048576 1048576 108294 /opt/tritonserver/backends/python 336 pre DEFAULT"
I0919 14:14:06.189815 108294 python_be.cc:1574] "model configuration:\n{\n    \"name\": \"pre\",\n    \"platform\": \"\",\n    \"backend\": \"python\",\n    \"runtime\": \"\",\n    \"version_policy\": {\n        \"latest\": {\n            \"num_versions\": 1\n        }\n    },\n    \"max_batch_size\": 0,\n    \"input\": [\n        {\n            \"name\": \"chat_input\",\n            \"data_type\": \"TYPE_STRING\",\n            \"format\": \"FORMAT_NONE\",\n            \"dims\": [\n                1\n            ],\n            \"is_shape_tensor\": false,\n            \"allow_ragged_batch\": false,\n            \"optional\": false,\n            \"is_non_linear_format_io\": false\n        }\n    ],\n    \"output\": [\n        {\n            \"name\": \"text_input\",\n            \"data_type\": \"TYPE_STRING\",\n            \"dims\": [\n                1\n            ],\n            \"label_filename\": \"\",\n            \"is_shape_tensor\": false,\n            \"is_non_linear_format_io\": false\n        }\n    ],\n    \"batch_input\": [],\n    \"batch_output\": [],\n    \"optimization\": {\n        \"priority\": \"PRIORITY_DEFAULT\",\n        \"input_pinned_memory\": {\n            \"enable\": true\n        },\n        \"output_pinned_memory\": {\n            \"enable\": true\n        },\n        \"gather_kernel_buffer_threshold\": 0,\n        \"eager_batching\": false\n    },\n    \"instance_group\": [\n        {\n            \"name\": \"pre_0\",\n            \"kind\": \"KIND_GPU\",\n            \"count\": 1,\n            \"gpus\": [\n                0\n            ],\n            \"secondary_devices\": [],\n            \"profile\": [],\n            \"passive\": false,\n            \"host_policy\": \"\"\n        }\n    ],\n    \"default_model_filename\": \"model.py\",\n    \"cc_model_filenames\": {},\n    \"metric_tags\": {},\n    \"parameters\": {},\n    \"model_warmup\": []\n}"
I0919 14:14:06.213012 108294 python_be.cc:1923] "TRITONBACKEND_ModelInstanceInitialize: pre_0_0 (GPU device 0)"
I0919 14:14:06.213568 108294 backend_model_instance.cc:106] "Creating instance pre_0_0 on GPU 0 (7.0) using artifact 'model.py'"
I0919 14:14:06.215171 108294 stub_launcher.cc:385] "Starting Python backend stub:  exec /opt/tritonserver/backends/python/triton_python_backend_stub /data/ebay/notebooks/jibxie/server3/triton_repo/pre/0/model.py triton_python_backend_shm_region_38a18194-0ab7-458e-a829-bf856c58e855 1048576 1048576 108294 /opt/tritonserver/backends/python 336 pre_0_0 DEFAULT"
I0919 14:14:15.361208 108294 python_be.cc:1944] "TRITONBACKEND_ModelInstanceInitialize: instance initialization successful pre_0_0 (device 0)"
I0919 14:14:15.361668 108294 backend_model_instance.cc:783] "Starting backend thread for pre_0_0 at nice 0 on device 0..."
I0919 14:14:15.362163 108294 model_lifecycle.cc:839] "successfully loaded 'pre'"
I0919 14:14:18.675985 108294 python_be.cc:1574] "model configuration:\n{\n    \"name\": \"main_app\",\n    \"platform\": \"\",\n    \"backend\": \"vllm\",\n    \"runtime\": \"model.py\",\n    \"version_policy\": {\n        \"latest\": {\n            \"num_versions\": 1\n        }\n    },\n    \"max_batch_size\": 0,\n    \"input\": [\n        {\n            \"name\": \"text_input\",\n            \"data_type\": \"TYPE_STRING\",\n            \"format\": \"FORMAT_NONE\",\n            \"dims\": [\n                1\n            ],\n            \"is_shape_tensor\": false,\n            \"allow_ragged_batch\": false,\n            \"optional\": false,\n            \"is_non_linear_format_io\": false\n        },\n        {\n            \"name\": \"stream\",\n            \"data_type\": \"TYPE_BOOL\",\n            \"format\": \"FORMAT_NONE\",\n            \"dims\": [\n                1\n            ],\n            \"is_shape_tensor\": false,\n            \"allow_ragged_batch\": false,\n            \"optional\": false,\n            \"is_non_linear_format_io\": false\n        },\n        {\n            \"name\": \"sampling_parameters\",\n            \"data_type\": \"TYPE_STRING\",\n            \"format\": \"FORMAT_NONE\",\n            \"dims\": [\n                1\n            ],\n            \"is_shape_tensor\": false,\n            \"allow_ragged_batch\": false,\n            \"optional\": false,\n            \"is_non_linear_format_io\": false\n        },\n        {\n            \"name\": \"exclude_input_in_output\",\n            \"data_type\": \"TYPE_BOOL\",\n            \"format\": \"FORMAT_NONE\",\n            \"dims\": [\n                1\n            ],\n            \"is_shape_tensor\": false,\n            \"allow_ragged_batch\": false,\n            \"optional\": false,\n            \"is_non_linear_format_io\": false\n        }\n    ],\n    \"output\": [\n        {\n            \"name\": \"text_output\",\n            \"data_type\": \"TYPE_STRING\",\n            \"dims\": [\n                1\n            ],\n            \"label_filename\": \"\",\n            \"is_shape_tensor\": false,\n            \"is_non_linear_format_io\": false\n        }\n    ],\n    \"batch_input\": [],\n    \"batch_output\": [],\n    \"optimization\": {\n        \"priority\": \"PRIORITY_DEFAULT\",\n        \"input_pinned_memory\": {\n            \"enable\": true\n        },\n        \"output_pinned_memory\": {\n            \"enable\": true\n        },\n        \"gather_kernel_buffer_threshold\": 0,\n        \"eager_batching\": false\n    },\n    \"instance_group\": [\n        {\n            \"name\": \"main_app_0\",\n            \"kind\": \"KIND_MODEL\",\n            \"count\": 1,\n            \"gpus\": [],\n            \"secondary_devices\": [],\n            \"profile\": [],\n            \"passive\": false,\n            \"host_policy\": \"\"\n        }\n    ],\n    \"default_model_filename\": \"\",\n    \"cc_model_filenames\": {},\n    \"metric_tags\": {},\n    \"parameters\": {},\n    \"model_warmup\": [],\n    \"model_transaction_policy\": {\n        \"decoupled\": true\n    }\n}"
I0919 14:14:18.692622 108294 python_be.cc:1923] "TRITONBACKEND_ModelInstanceInitialize: main_app_0_0 (MODEL device 0)"
I0919 14:14:18.692799 108294 backend_model_instance.cc:77] "Creating instance main_app_0_0 on model-specified devices using artifact ''"
I0919 14:14:18.693963 108294 stub_launcher.cc:385] "Starting Python backend stub:  exec /opt/tritonserver/backends/python/triton_python_backend_stub /data/ebay/notebooks/jibxie/server3/triton_repo/main_app/0/model.py triton_vllm_backend_shm_region_ef233a4c-f0ae-44cc-bad6-33d93b834030 1048576 1048576 108294 /opt/tritonserver/backends/python 336 main_app_0_0 /opt/tritonserver/backends/vllm"
I0919 14:14:29.265212 108294 pb_stub.cc:366] "Failed to initialize Python stub: ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Tesla V100-SXM2-32GB GPU has compute capability 7.0. You can use float16 instead by explicitly setting the`dtype` flag in CLI, for example: --dtype=half.\n\nAt:\n  /usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py(460): _check_if_gpu_supports_dtype\n  /usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py(168): init_device\n  /usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py(39): _init_executor\n  /usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py(47): __init__\n  /usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py(305): __init__\n  /usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py(262): __init__\n  /usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py(835): _init_engine\n  /usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py(615): __init__\n  /usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py(735): from_engine_args\n  /opt/tritonserver/backends/vllm/model.py(161): init_engine\n  /opt/tritonserver/backends/vllm/model.py(115): initialize\n"
I0919 14:14:30.487874 108294 python_be.cc:2061] "TRITONBACKEND_ModelInstanceFinalize: delete instance state"
E0919 14:14:30.488397 108294 backend_model.cc:692] "ERROR: Failed to create instance: ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Tesla V100-SXM2-32GB GPU has compute capability 7.0. You can use float16 instead by explicitly setting the`dtype` flag in CLI, for example: --dtype=half.\n\nAt:\n  /usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py(460): _check_if_gpu_supports_dtype\n  /usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py(168): init_device\n  /usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py(39): _init_executor\n  /usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py(47): __init__\n  /usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py(305): __init__\n  /usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py(262): __init__\n  /usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py(835): _init_engine\n  /usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py(615): __init__\n  /usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py(735): from_engine_args\n  /opt/tritonserver/backends/vllm/model.py(161): init_engine\n  /opt/tritonserver/backends/vllm/model.py(115): initialize\n"
I0919 14:14:30.488583 108294 python_be.cc:1902] "TRITONBACKEND_ModelFinalize: delete model state"
E0919 14:14:30.488700 108294 model_lifecycle.cc:642] "failed to load 'main_app' version 0: Internal: ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Tesla V100-SXM2-32GB GPU has compute capability 7.0. You can use float16 instead by explicitly setting the`dtype` flag in CLI, for example: --dtype=half.\n\nAt:\n  /usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py(460): _check_if_gpu_supports_dtype\n  /usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py(168): init_device\n  /usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py(39): _init_executor\n  /usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py(47): __init__\n  /usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py(305): __init__\n  /usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py(262): __init__\n  /usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py(835): _init_engine\n  /usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py(615): __init__\n  /usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py(735): from_engine_args\n  /opt/tritonserver/backends/vllm/model.py(161): init_engine\n  /opt/tritonserver/backends/vllm/model.py(115): initialize\n"
I0919 14:14:30.488825 108294 model_lifecycle.cc:777] "failed to load 'main_app'"
I0919 14:14:30.488961 108294 server.cc:604] 
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0919 14:14:30.489056 108294 server.cc:631] 
+---------+-------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Backend | Path                                                  | Config                                                                                                                                                        |
+---------+-------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| vllm    | /opt/tritonserver/backends/vllm/model.py              | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} |
| python  | /opt/tritonserver/backends/python/libtriton_python.so | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} |
+---------+-------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0919 14:14:30.489177 108294 server.cc:674] 
+----------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Model    | Version | Status                                                                                                                                                                                                                                                                        |
+----------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| main_app | 0       | UNAVAILABLE: Internal: ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Tesla V100-SXM2-32GB GPU has compute capability 7.0. You can use float16 instead by explicitly setting the`dtype` flag in CLI, for example: --dtype=half. |
|          |         |                                                                                                                                                                                                                                                                               |
|          |         | At:                                                                                                                                                                                                                                                                           |
|          |         |   /usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py(460): _check_if_gpu_supports_dtype                                                                                                                                                                            |
|          |         |   /usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py(168): init_device                                                                                                                                                                                             |
|          |         |   /usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py(39): _init_executor                                                                                                                                                                                   |
|          |         |   /usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py(47): __init__                                                                                                                                                                                        |
|          |         |   /usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py(305): __init__                                                                                                                                                                                            |
|          |         |   /usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py(262): __init__                                                                                                                                                                                      |
|          |         |   /usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py(835): _init_engine                                                                                                                                                                                  |
|          |         |   /usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py(615): __init__                                                                                                                                                                                      |
|          |         |   /usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py(735): from_engine_args                                                                                                                                                                              |
|          |         |   /opt/tritonserver/backends/vllm/model.py(161): init_engine                                                                                                                                                                                                                  |
|          |         |   /opt/tritonserver/backends/vllm/model.py(115): initialize                                                                                                                                                                                                                   |
| pre      | 0       | READY                                                                                                                                                                                                                                                                         |
+----------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0919 14:14:30.528079 108294 metrics.cc:877] "Collecting metrics for GPU 0: Tesla V100-SXM2-32GB"
I0919 14:14:30.531197 108294 metrics.cc:770] "Collecting CPU metrics"
I0919 14:14:30.531426 108294 tritonserver.cc:2598] 
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                                                                           |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                                                                          |
| server_version                   | 2.49.0                                                                                                                                                                                                          |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
| model_repository_path[0]         | /data/ebay/notebooks/jibxie/server3/triton_repo                                                                                                                                                                 |
| model_control_mode               | MODE_NONE                                                                                                                                                                                                       |
| strict_model_config              | 0                                                                                                                                                                                                               |
| model_config_name                |                                                                                                                                                                                                                 |
| rate_limit                       | OFF                                                                                                                                                                                                             |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                                                       |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                                                                        |
| min_supported_compute_capability | 6.0                                                                                                                                                                                                             |
| strict_readiness                 | 1                                                                                                                                                                                                               |
| exit_timeout                     | 30                                                                                                                                                                                                              |
| cache_enabled                    | 0                                                                                                                                                                                                               |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0919 14:14:30.532441 108294 server.cc:305] "Waiting for in-flight requests to complete."
I0919 14:14:30.532532 108294 server.cc:321] "Timeout 30: Found 0 model versions that have in-flight inferences"
I0919 14:14:30.532752 108294 server.cc:336] "All models are stopped, unloading models"
I0919 14:14:30.532809 108294 server.cc:345] "Timeout 30: Found 1 live models and 0 in-flight non-inference requests"
I0919 14:14:30.532860 108294 server.cc:351] "pre v0: UNLOADING"
I0919 14:14:30.532863 108294 backend_model_instance.cc:806] "Stopping backend thread for pre_0_0..."
I0919 14:14:30.533209 108294 python_be.cc:2061] "TRITONBACKEND_ModelInstanceFinalize: delete instance state"
I0919 14:14:31.533058 108294 server.cc:345] "Timeout 29: Found 1 live models and 0 in-flight non-inference requests"
I0919 14:14:31.534177 108294 server.cc:351] "pre v0: UNLOADING"
I0919 14:14:32.323361 108294 python_be.cc:1902] "TRITONBACKEND_ModelFinalize: delete model state"
I0919 14:14:32.323536 108294 model_lifecycle.cc:624] "successfully unloaded 'pre' version 0"
I0919 14:14:32.544229 108294 server.cc:345] "Timeout 28: Found 0 live models and 0 in-flight non-inference requests"
I0919 14:14:32.612651 108294 backend_manager.cc:138] "unloading backend 'python'"
I0919 14:14:32.612775 108294 python_be.cc:1859] "TRITONBACKEND_Finalize: Start"
I0919 14:14:32.613085 108294 python_be.cc:1864] "TRITONBACKEND_Finalize: End"
I0919 14:14:32.613129 108294 backend_manager.cc:138] "unloading backend 'vllm'"
I0919 14:14:32.613163 108294 python_be.cc:1859] "TRITONBACKEND_Finalize: Start"
I0919 14:14:32.613274 108294 python_be.cc:1864] "TRITONBACKEND_Finalize: End"
I0919 14:23:31.841443 119719 cache_manager.cc:480] "Create CacheManager with cache_dir: '/opt/tritonserver/caches'"
I0919 14:23:32.154140 119719 pinned_memory_manager.cc:277] "Pinned memory pool is created at '0x7f447c000000' with size 268435456"
I0919 14:23:32.155322 119719 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 0 with size 67108864"
I0919 14:23:32.250938 119719 model_config_utils.cc:716] "Server side auto-completed config: "
name: "ensemble"
platform: "ensemble"
input {
  name: "chat_input"
  data_type: TYPE_STRING
  dims: 1
}
output {
  name: "text_output"
  data_type: TYPE_STRING
  dims: 1
}
ensemble_scheduling {
  step {
    model_name: "pre"
    model_version: -1
    input_map {
      key: "chat_input"
      value: "chat_input"
    }
    output_map {
      key: "text_input"
      value: "text_input"
    }
  }
  step {
    model_name: "main_app"
    model_version: -1
    input_map {
      key: "text_input"
      value: "text_input"
    }
    output_map {
      key: "text_output"
      value: "text_output"
    }
  }
}

E0919 14:23:32.252092 119719 model_repository_manager.cc:1460] "Poll failed for model directory 'ensemble': unexpected platform type 'ensemble' for ensemble"
I0919 14:23:32.315417 119719 model_config_utils.cc:716] "Server side auto-completed config: "
name: "main_app"
input {
  name: "text_input"
  data_type: TYPE_STRING
  dims: 1
}
input {
  name: "stream"
  data_type: TYPE_BOOL
  dims: 1
}
input {
  name: "sampling_parameters"
  data_type: TYPE_STRING
  dims: 1
}
input {
  name: "exclude_input_in_output"
  data_type: TYPE_BOOL
  dims: 1
}
output {
  name: "text_output"
  data_type: TYPE_STRING
  dims: 1
}
instance_group {
  count: 1
  kind: KIND_MODEL
}
backend: "vllm"

I0919 14:23:32.407850 119719 model_config_utils.cc:716] "Server side auto-completed config: "
name: "pre"
input {
  name: "chat_input"
  data_type: TYPE_STRING
  dims: 1
}
output {
  name: "text_input"
  data_type: TYPE_STRING
  dims: 1
}
instance_group {
  count: 1
}
default_model_filename: "model.py"
backend: "python"

I0919 14:23:32.421282 119719 model_lifecycle.cc:472] "loading: main_app:0"
I0919 14:23:32.424834 119719 backend_model.cc:503] "Adding default backend config setting: default-max-batch-size,4"
I0919 14:23:32.424906 119719 shared_library.cc:112] "OpenLibraryHandle: /opt/tritonserver/backends/python/libtriton_python.so"
I0919 14:23:32.426017 119719 python_be.cc:1618] "'vllm' TRITONBACKEND API version: 1.19"
I0919 14:23:32.426094 119719 python_be.cc:1640] "backend configuration:\n{\"cmdline\":{\"auto-complete-config\":\"true\",\"backend-directory\":\"/opt/tritonserver/backends\",\"min-compute-capability\":\"6.000000\",\"default-max-batch-size\":\"4\"}}"
I0919 14:23:32.426139 119719 python_be.cc:1778] "Shared memory configuration is shm-default-byte-size=1048576,shm-growth-byte-size=1048576,stub-timeout-seconds=30"
I0919 14:23:32.426266 119719 python_be.cc:2075] "TRITONBACKEND_GetBackendAttribute: setting attributes"
I0919 14:23:32.427237 119719 python_be.cc:1879] "TRITONBACKEND_ModelInitialize: main_app (version 0)"
I0919 14:23:32.427726 119719 model_config_utils.cc:1941] "ModelConfig 64-bit fields:"
I0919 14:23:32.427765 119719 model_config_utils.cc:1943] "\tModelConfig::dynamic_batching::default_priority_level"
I0919 14:23:32.427799 119719 model_config_utils.cc:1943] "\tModelConfig::dynamic_batching::default_queue_policy::default_timeout_microseconds"
I0919 14:23:32.427830 119719 model_config_utils.cc:1943] "\tModelConfig::dynamic_batching::max_queue_delay_microseconds"
I0919 14:23:32.427858 119719 model_config_utils.cc:1943] "\tModelConfig::dynamic_batching::priority_levels"
I0919 14:23:32.427888 119719 model_config_utils.cc:1943] "\tModelConfig::dynamic_batching::priority_queue_policy::key"
I0919 14:23:32.427921 119719 model_config_utils.cc:1943] "\tModelConfig::dynamic_batching::priority_queue_policy::value::default_timeout_microseconds"
I0919 14:23:32.427952 119719 model_config_utils.cc:1943] "\tModelConfig::ensemble_scheduling::step::model_version"
I0919 14:23:32.427983 119719 model_config_utils.cc:1943] "\tModelConfig::input::dims"
I0919 14:23:32.428015 119719 model_config_utils.cc:1943] "\tModelConfig::input::reshape::shape"
I0919 14:23:32.428045 119719 model_config_utils.cc:1943] "\tModelConfig::instance_group::secondary_devices::device_id"
I0919 14:23:32.428075 119719 model_config_utils.cc:1943] "\tModelConfig::model_warmup::inputs::value::dims"
I0919 14:23:32.428105 119719 model_config_utils.cc:1943] "\tModelConfig::optimization::cuda::graph_spec::graph_lower_bound::input::value::dim"
I0919 14:23:32.428134 119719 model_config_utils.cc:1943] "\tModelConfig::optimization::cuda::graph_spec::input::value::dim"
I0919 14:23:32.428173 119719 model_config_utils.cc:1943] "\tModelConfig::output::dims"
I0919 14:23:32.428231 119719 model_config_utils.cc:1943] "\tModelConfig::output::reshape::shape"
I0919 14:23:32.428264 119719 model_config_utils.cc:1943] "\tModelConfig::sequence_batching::direct::max_queue_delay_microseconds"
I0919 14:23:32.428296 119719 model_config_utils.cc:1943] "\tModelConfig::sequence_batching::max_sequence_idle_microseconds"
I0919 14:23:32.428346 119719 model_config_utils.cc:1943] "\tModelConfig::sequence_batching::oldest::max_queue_delay_microseconds"
I0919 14:23:32.428377 119719 model_config_utils.cc:1943] "\tModelConfig::sequence_batching::state::dims"
I0919 14:23:32.428408 119719 model_config_utils.cc:1943] "\tModelConfig::sequence_batching::state::initial_state::dims"
I0919 14:23:32.428444 119719 model_config_utils.cc:1943] "\tModelConfig::version_policy::specific::versions"
I0919 14:23:32.429498 119719 stub_launcher.cc:385] "Starting Python backend stub:  exec /opt/tritonserver/backends/python/triton_python_backend_stub /data/ebay/notebooks/jibxie/server3/triton_repo/main_app/0/model.py triton_vllm_backend_shm_region_26cc0685-4c46-458a-958a-66ab43ecbd73 1048576 1048576 119719 /opt/tritonserver/backends/python 336 main_app /opt/tritonserver/backends/vllm"
I0919 14:23:32.452943 119719 model_lifecycle.cc:472] "loading: pre:0"
I0919 14:23:32.454982 119719 backend_model.cc:503] "Adding default backend config setting: default-max-batch-size,4"
I0919 14:23:32.455139 119719 shared_library.cc:112] "OpenLibraryHandle: /opt/tritonserver/backends/python/libtriton_python.so"
I0919 14:23:32.455287 119719 python_be.cc:1618] "'python' TRITONBACKEND API version: 1.19"
I0919 14:23:32.455339 119719 python_be.cc:1640] "backend configuration:\n{\"cmdline\":{\"auto-complete-config\":\"true\",\"backend-directory\":\"/opt/tritonserver/backends\",\"min-compute-capability\":\"6.000000\",\"default-max-batch-size\":\"4\"}}"
I0919 14:23:32.455412 119719 python_be.cc:1778] "Shared memory configuration is shm-default-byte-size=1048576,shm-growth-byte-size=1048576,stub-timeout-seconds=30"
I0919 14:23:32.455620 119719 python_be.cc:2075] "TRITONBACKEND_GetBackendAttribute: setting attributes"
I0919 14:23:32.456671 119719 python_be.cc:1879] "TRITONBACKEND_ModelInitialize: pre (version 0)"
I0919 14:23:32.458414 119719 stub_launcher.cc:385] "Starting Python backend stub:  exec /opt/tritonserver/backends/python/triton_python_backend_stub /data/ebay/notebooks/jibxie/server3/triton_repo/pre/0/model.py triton_python_backend_shm_region_6007fcb9-6b4a-4198-9cf3-40068f44cf4b 1048576 1048576 119719 /opt/tritonserver/backends/python 336 pre DEFAULT"
I0919 14:23:34.891998 119719 python_be.cc:1574] "model configuration:\n{\n    \"name\": \"pre\",\n    \"platform\": \"\",\n    \"backend\": \"python\",\n    \"runtime\": \"\",\n    \"version_policy\": {\n        \"latest\": {\n            \"num_versions\": 1\n        }\n    },\n    \"max_batch_size\": 0,\n    \"input\": [\n        {\n            \"name\": \"chat_input\",\n            \"data_type\": \"TYPE_STRING\",\n            \"format\": \"FORMAT_NONE\",\n            \"dims\": [\n                1\n            ],\n            \"is_shape_tensor\": false,\n            \"allow_ragged_batch\": false,\n            \"optional\": false,\n            \"is_non_linear_format_io\": false\n        }\n    ],\n    \"output\": [\n        {\n            \"name\": \"text_input\",\n            \"data_type\": \"TYPE_STRING\",\n            \"dims\": [\n                1\n            ],\n            \"label_filename\": \"\",\n            \"is_shape_tensor\": false,\n            \"is_non_linear_format_io\": false\n        }\n    ],\n    \"batch_input\": [],\n    \"batch_output\": [],\n    \"optimization\": {\n        \"priority\": \"PRIORITY_DEFAULT\",\n        \"input_pinned_memory\": {\n            \"enable\": true\n        },\n        \"output_pinned_memory\": {\n            \"enable\": true\n        },\n        \"gather_kernel_buffer_threshold\": 0,\n        \"eager_batching\": false\n    },\n    \"instance_group\": [\n        {\n            \"name\": \"pre_0\",\n            \"kind\": \"KIND_GPU\",\n            \"count\": 1,\n            \"gpus\": [\n                0\n            ],\n            \"secondary_devices\": [],\n            \"profile\": [],\n            \"passive\": false,\n            \"host_policy\": \"\"\n        }\n    ],\n    \"default_model_filename\": \"model.py\",\n    \"cc_model_filenames\": {},\n    \"metric_tags\": {},\n    \"parameters\": {},\n    \"model_warmup\": []\n}"
I0919 14:23:34.915480 119719 python_be.cc:1923] "TRITONBACKEND_ModelInstanceInitialize: pre_0_0 (GPU device 0)"
I0919 14:23:34.916030 119719 backend_model_instance.cc:106] "Creating instance pre_0_0 on GPU 0 (7.0) using artifact 'model.py'"
I0919 14:23:34.917120 119719 stub_launcher.cc:385] "Starting Python backend stub:  exec /opt/tritonserver/backends/python/triton_python_backend_stub /data/ebay/notebooks/jibxie/server3/triton_repo/pre/0/model.py triton_python_backend_shm_region_ba745961-f89e-4f1e-92e9-c0a8e3cff428 1048576 1048576 119719 /opt/tritonserver/backends/python 336 pre_0_0 DEFAULT"
I0919 14:23:43.356136 119719 python_be.cc:1944] "TRITONBACKEND_ModelInstanceInitialize: instance initialization successful pre_0_0 (device 0)"
I0919 14:23:43.356512 119719 backend_model_instance.cc:783] "Starting backend thread for pre_0_0 at nice 0 on device 0..."
I0919 14:23:43.356923 119719 model_lifecycle.cc:839] "successfully loaded 'pre'"
I0919 14:23:46.310218 119719 python_be.cc:1574] "model configuration:\n{\n    \"name\": \"main_app\",\n    \"platform\": \"\",\n    \"backend\": \"vllm\",\n    \"runtime\": \"model.py\",\n    \"version_policy\": {\n        \"latest\": {\n            \"num_versions\": 1\n        }\n    },\n    \"max_batch_size\": 0,\n    \"input\": [\n        {\n            \"name\": \"text_input\",\n            \"data_type\": \"TYPE_STRING\",\n            \"format\": \"FORMAT_NONE\",\n            \"dims\": [\n                1\n            ],\n            \"is_shape_tensor\": false,\n            \"allow_ragged_batch\": false,\n            \"optional\": false,\n            \"is_non_linear_format_io\": false\n        },\n        {\n            \"name\": \"stream\",\n            \"data_type\": \"TYPE_BOOL\",\n            \"format\": \"FORMAT_NONE\",\n            \"dims\": [\n                1\n            ],\n            \"is_shape_tensor\": false,\n            \"allow_ragged_batch\": false,\n            \"optional\": false,\n            \"is_non_linear_format_io\": false\n        },\n        {\n            \"name\": \"sampling_parameters\",\n            \"data_type\": \"TYPE_STRING\",\n            \"format\": \"FORMAT_NONE\",\n            \"dims\": [\n                1\n            ],\n            \"is_shape_tensor\": false,\n            \"allow_ragged_batch\": false,\n            \"optional\": false,\n            \"is_non_linear_format_io\": false\n        },\n        {\n            \"name\": \"exclude_input_in_output\",\n            \"data_type\": \"TYPE_BOOL\",\n            \"format\": \"FORMAT_NONE\",\n            \"dims\": [\n                1\n            ],\n            \"is_shape_tensor\": false,\n            \"allow_ragged_batch\": false,\n            \"optional\": false,\n            \"is_non_linear_format_io\": false\n        }\n    ],\n    \"output\": [\n        {\n            \"name\": \"text_output\",\n            \"data_type\": \"TYPE_STRING\",\n            \"dims\": [\n                1\n            ],\n            \"label_filename\": \"\",\n            \"is_shape_tensor\": false,\n            \"is_non_linear_format_io\": false\n        }\n    ],\n    \"batch_input\": [],\n    \"batch_output\": [],\n    \"optimization\": {\n        \"priority\": \"PRIORITY_DEFAULT\",\n        \"input_pinned_memory\": {\n            \"enable\": true\n        },\n        \"output_pinned_memory\": {\n            \"enable\": true\n        },\n        \"gather_kernel_buffer_threshold\": 0,\n        \"eager_batching\": false\n    },\n    \"instance_group\": [\n        {\n            \"name\": \"main_app_0\",\n            \"kind\": \"KIND_MODEL\",\n            \"count\": 1,\n            \"gpus\": [],\n            \"secondary_devices\": [],\n            \"profile\": [],\n            \"passive\": false,\n            \"host_policy\": \"\"\n        }\n    ],\n    \"default_model_filename\": \"\",\n    \"cc_model_filenames\": {},\n    \"metric_tags\": {},\n    \"parameters\": {},\n    \"model_warmup\": [],\n    \"model_transaction_policy\": {\n        \"decoupled\": true\n    }\n}"
I0919 14:23:46.321724 119719 python_be.cc:1923] "TRITONBACKEND_ModelInstanceInitialize: main_app_0_0 (MODEL device 0)"
I0919 14:23:46.321840 119719 backend_model_instance.cc:77] "Creating instance main_app_0_0 on model-specified devices using artifact ''"
I0919 14:23:46.322860 119719 stub_launcher.cc:385] "Starting Python backend stub:  exec /opt/tritonserver/backends/python/triton_python_backend_stub /data/ebay/notebooks/jibxie/server3/triton_repo/main_app/0/model.py triton_vllm_backend_shm_region_9d8f3207-5a7a-4e78-a1b8-cdb473c24e6f 1048576 1048576 119719 /opt/tritonserver/backends/python 336 main_app_0_0 /opt/tritonserver/backends/vllm"
I0919 14:24:52.260247 119719 python_be.cc:1944] "TRITONBACKEND_ModelInstanceInitialize: instance initialization successful main_app_0_0 (device 0)"
I0919 14:24:52.262008 119719 backend_model_instance.cc:783] "Starting backend thread for main_app_0_0 at nice 0 on device 0..."
I0919 14:24:52.262373 119719 model_lifecycle.cc:839] "successfully loaded 'main_app'"
I0919 14:24:52.266171 119719 server.cc:604] 
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0919 14:24:52.266247 119719 server.cc:631] 
+---------+-------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Backend | Path                                                  | Config                                                                                                                                                        |
+---------+-------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| vllm    | /opt/tritonserver/backends/vllm/model.py              | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} |
| python  | /opt/tritonserver/backends/python/libtriton_python.so | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} |
+---------+-------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0919 14:24:52.266350 119719 server.cc:674] 
+----------+---------+--------+
| Model    | Version | Status |
+----------+---------+--------+
| main_app | 0       | READY  |
| pre      | 0       | READY  |
+----------+---------+--------+

I0919 14:24:52.289587 119719 metrics.cc:877] "Collecting metrics for GPU 0: Tesla V100-SXM2-32GB"
I0919 14:24:52.291630 119719 metrics.cc:770] "Collecting CPU metrics"
I0919 14:24:52.292000 119719 tritonserver.cc:2598] 
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                                                                           |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                                                                          |
| server_version                   | 2.49.0                                                                                                                                                                                                          |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
| model_repository_path[0]         | /data/ebay/notebooks/jibxie/server3/triton_repo                                                                                                                                                                 |
| model_control_mode               | MODE_NONE                                                                                                                                                                                                       |
| strict_model_config              | 0                                                                                                                                                                                                               |
| model_config_name                |                                                                                                                                                                                                                 |
| rate_limit                       | OFF                                                                                                                                                                                                             |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                                                       |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                                                                        |
| min_supported_compute_capability | 6.0                                                                                                                                                                                                             |
| strict_readiness                 | 1                                                                                                                                                                                                               |
| exit_timeout                     | 30                                                                                                                                                                                                              |
| cache_enabled                    | 0                                                                                                                                                                                                               |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0919 14:24:52.292137 119719 server.cc:305] "Waiting for in-flight requests to complete."
I0919 14:24:52.292210 119719 server.cc:321] "Timeout 30: Found 0 model versions that have in-flight inferences"
I0919 14:24:52.292399 119719 server.cc:336] "All models are stopped, unloading models"
I0919 14:24:52.292444 119719 server.cc:345] "Timeout 30: Found 2 live models and 0 in-flight non-inference requests"
I0919 14:24:52.292479 119719 backend_model_instance.cc:806] "Stopping backend thread for main_app_0_0..."
I0919 14:24:52.292468 119719 backend_model_instance.cc:806] "Stopping backend thread for pre_0_0..."
I0919 14:24:52.292486 119719 server.cc:351] "main_app v0: UNLOADING"
I0919 14:24:52.292561 119719 python_be.cc:2061] "TRITONBACKEND_ModelInstanceFinalize: delete instance state"
I0919 14:24:52.292619 119719 python_be.cc:2061] "TRITONBACKEND_ModelInstanceFinalize: delete instance state"
I0919 14:24:52.292631 119719 server.cc:351] "pre v0: UNLOADING"
I0919 14:24:53.292956 119719 server.cc:345] "Timeout 29: Found 2 live models and 0 in-flight non-inference requests"
I0919 14:24:53.293286 119719 server.cc:351] "main_app v0: UNLOADING"
I0919 14:24:53.293588 119719 server.cc:351] "pre v0: UNLOADING"
I0919 14:24:53.294885 119719 model.py:558] "[vllm] Issuing finalize to vllm backend"
I0919 14:24:53.928328 119719 python_be.cc:1902] "TRITONBACKEND_ModelFinalize: delete model state"
I0919 14:24:53.928467 119719 model_lifecycle.cc:624] "successfully unloaded 'pre' version 0"
I0919 14:24:54.293779 119719 server.cc:345] "Timeout 28: Found 1 live models and 0 in-flight non-inference requests"
I0919 14:24:54.293978 119719 server.cc:351] "main_app v0: UNLOADING"
I0919 14:24:55.294307 119719 server.cc:345] "Timeout 27: Found 1 live models and 0 in-flight non-inference requests"
I0919 14:24:55.294425 119719 server.cc:351] "main_app v0: UNLOADING"
I0919 14:24:56.294691 119719 server.cc:345] "Timeout 26: Found 1 live models and 0 in-flight non-inference requests"
I0919 14:24:56.294874 119719 server.cc:351] "main_app v0: UNLOADING"
I0919 14:24:57.260756 119719 model.py:272] "[vllm] Shutdown complete"
I0919 14:24:57.260969 119719 model.py:574] "[vllm] Running Garbage Collector on finalize..."
I0919 14:24:57.295167 119719 server.cc:345] "Timeout 25: Found 1 live models and 0 in-flight non-inference requests"
I0919 14:24:57.295347 119719 server.cc:351] "main_app v0: UNLOADING"
I0919 14:24:57.418582 119719 model.py:576] "[vllm] Garbage Collector on finalize... done"
I0919 14:24:58.295645 119719 server.cc:345] "Timeout 24: Found 1 live models and 0 in-flight non-inference requests"
I0919 14:24:58.295823 119719 server.cc:351] "main_app v0: UNLOADING"
I0919 14:24:59.296175 119719 server.cc:345] "Timeout 23: Found 1 live models and 0 in-flight non-inference requests"
I0919 14:24:59.296445 119719 server.cc:351] "main_app v0: UNLOADING"
I0919 14:24:59.809669 119719 python_be.cc:1902] "TRITONBACKEND_ModelFinalize: delete model state"
I0919 14:24:59.809982 119719 model_lifecycle.cc:624] "successfully unloaded 'main_app' version 0"
I0919 14:25:00.296861 119719 server.cc:345] "Timeout 22: Found 0 live models and 0 in-flight non-inference requests"
I0919 14:25:00.370955 119719 backend_manager.cc:138] "unloading backend 'python'"
I0919 14:25:00.371120 119719 python_be.cc:1859] "TRITONBACKEND_Finalize: Start"
I0919 14:25:00.371457 119719 python_be.cc:1864] "TRITONBACKEND_Finalize: End"
I0919 14:25:00.371512 119719 backend_manager.cc:138] "unloading backend 'vllm'"
I0919 14:25:00.371586 119719 python_be.cc:1859] "TRITONBACKEND_Finalize: Start"
I0919 14:25:00.371664 119719 python_be.cc:1864] "TRITONBACKEND_Finalize: End"

Expected behavior A clear and concise description of what you expected to happen.

KrishnanPrash commented 1 month ago

Hello @xiejibing,

Thank you for bringing this to our attention. 24.08-vllm-python-py3 does not support ensemble models. Support for ensemble models is currently being added for future releases of the vLLM container and should tentatively be added in the 24.10 release.

A temporary fix for now could be:

  1. Using the base container and adding the vllm_backend to the container Currently, 24.08-py3 supports ensemble models. So, you could add the vLLM backend to this container by following these instructions and use this container for your ensemble model.
  2. Building from source Otherwise, you could build from source and append --backend=ensemble to the build arguments to enable ensemble model support in the vLLM container:
xiejibing commented 1 month ago

Thank you for your suggestion! We will use the base container and install the vllm firstly.

oandreeva-nv commented 1 month ago

I'll close this issue, since the functionality has been merged and is targeting 24.10 release.