Open DZADSL72-00558 opened 2 months ago
I resolved by running PMIX_MCA_gds=hash
by referring to this page https://github.com/open-mpi/ompi/issues/6981. Please tell me if my solution makes sense.
after I added this solution. Server gets stuck after https://github.com/triton-inference-server/python_backend/blob/main/src/stub_launcher.cc#L253-L256. Please let me know where could be wrong
Hi @DZADSL72-00558,
How many H100 GPUs are there?
Could you share the config files? And by server gets stuck after .../src/stub_launcher.cc
, which python backend get stucks, the tensorrt_llm or the preprocessing/postprocessing? Is it possible to share the model.py as well?
Hi Slyne,
Nice to hear from you. I like your profile BTW.
How many H100 GPUs are there?
So as we are using p5 so, the only answer is 8.
Could you share the config files?
here is the config for trtllm
name: "tensorrt_llm"
backend: "python"
max_batch_size: 0
# # Uncomment this for dynamic_batching
# dynamic_batching {
# max_queue_delay_microseconds: 50000
# }
input [
{
name: "INPUT_ID"
data_type: TYPE_INT32
dims: [ 1, -1 ]
},
{
name: "PROMPT_TABLE"
data_type: TYPE_FP16
dims: [ -1, -1 ]
},
{
name: "request_output_len"
data_type: TYPE_UINT32
dims: [ -1 ]
},
{
name: "END_ID"
data_type: TYPE_INT32
dims: [ 1 ]
},
{
name: "PAD_ID"
data_type: TYPE_INT32
dims: [ 1 ]
},
{
name: "runtime_top_k"
data_type: TYPE_UINT32
dims: [ 1 ]
optional: true
},
{
name: "runtime_top_p"
data_type: TYPE_FP32
dims: [ 1 ]
optional: true
},
{
name: "temperature"
data_type: TYPE_FP32
dims: [ 1 ]
optional: true
},
{
name: "len_penalty"
data_type: TYPE_FP32
dims: [ 1 ]
optional: true
},
{
name: "repetition_penalty"
data_type: TYPE_FP32
dims: [ 1 ]
optional: true
},
{
name: "min_length"
data_type: TYPE_UINT32
dims: [ 1 ]
optional: true
},
{
name: "presence_penalty"
data_type: TYPE_FP32
dims: [ 1 ]
optional: true
},
{
name: "frequency_penalty"
data_type: TYPE_FP32
dims: [ 1 ]
optional: true
},
{
name: "random_seed"
data_type: TYPE_UINT64
dims: [ 1 ]
optional: true
},
{
name: "beam_width"
data_type: TYPE_UINT32
dims: [ 1 ]
optional: true
},
{
name: "output_log_probs"
data_type: TYPE_BOOL
dims: [ 1 ]
optional: true
}
]
output [
{
name: "output_ids"
data_type: TYPE_INT32
dims: [ 1, -1 ]
},
{
name: "request_input_len"
data_type: TYPE_INT32
dims: [ 1, 1 ]
}
]
instance_group [
{
count: 1
kind: KIND_GPU
}
]
parameters: {
key: "engine_dir"
value: {
string_value: "/tmp/models/agm/tensorrt_llm/1/engine"
}
}
parameters: {
key: "exclude_input_in_output"
value: {
string_value: "yes"
}
}
parameters: {
key: "FORCE_CPU_ONLY_INPUT_TENSORS"
value: {
string_value: "no"
}
}
which python backend get stucks
it is tensorrt_llm
Is it possible to share the model.py as well
Hmmm, not sure if I can share the entire file, but I have the initialize
function attached.
I think I have some findings that might clarify the issue. The hang seems to be related to this old issue: https://github.com/triton-inference-server/server/issues/3777. In that issue, it was (eventually) discovered that import torch
in the model.py
caused an invalid pointer free and SIGABRT
. (Some searching seems to indicate that this happens when pybind
tries to load torch
; it's not specific to Triton.)
The SIGABRT
(maybe surprisingly) does not seem to have any negative impact when tritonserver
is directly invoked, but it does correlate with the hang we see in this issue, when tritonserver
is invoked via mpirun
(to support TP). In particular, when we load just the postprocessing
model (which does not import torch
) via the following command:
mpirun --allow-run-as-root -n 1 tritonserver --model-repository=/opt/amazon/alexa_triton_inference_engine/configuration/agm-streaming/ --http-port=8002 --grpc-port=8003 --model-load-thread-count=1 --model-control-mode=explicit --load-model=postprocessing --log-verbose=3
then the server seems to start correctly. (Note that I used -n 1
to avoid extraneous issues.) However, I get a hang with the following command (identical to the above but with preprocessing
, which does import torch
):
mpirun --allow-run-as-root -n 1 tritonserver --model-repository=/opt/amazon/alexa_triton_inference_engine/configuration/agm-streaming/ --http-port=8002 --grpc-port=8003 --model-load-thread-count=1 --model-control-mode=explicit --load-model=preprocessing --log-verbose=3
Here is the output up until the hang:
I0812 16:23:57.708810 2689 cache_manager.cc:480] Create CacheManager with cache_dir: '/opt/tritonserver/caches'
I0812 16:23:58.156896 2689 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7f2e80000000' with size 268435456
I0812 16:23:58.159435 2689 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I0812 16:23:58.164541 2689 model_config_utils.cc:680] Server side auto-completed config: name: "preprocessing"
input {
name: "TEXT_TOKENS"
data_type: TYPE_INT32
dims: 1
dims: -1
}
input {
name: "SPEECH_EMBEDDINGS"
data_type: TYPE_FP32
dims: 1
dims: -1
dims: -1
}
input {
name: "MODALITY_SEQUENCE"
data_type: TYPE_UINT32
dims: 1
dims: -1
optional: true
}
output {
name: "INPUT_ID"
data_type: TYPE_INT32
dims: 1
dims: -1
}
output {
name: "PROMPT_TABLE"
data_type: TYPE_FP16
dims: -1
dims: -1
}
output {
name: "END_ID"
data_type: TYPE_INT32
dims: 1
}
output {
name: "PAD_ID"
data_type: TYPE_INT32
dims: 1
}
instance_group {
count: 1
kind: KIND_CPU
}
default_model_filename: "model.py"
parameters {
key: "audio_modality_indicator_token"
value {
string_value: "1"
}
}
parameters {
key: "encoder_projections_bias"
value {
string_value: "continuous_speech_embedding_fn.bias"
}
}
parameters {
key: "encoder_projections_dir"
value {
string_value: "/tmp/models/agm/preprocessing/1/encoder_projection/"
}
}
parameters {
key: "encoder_projections_weight"
value {
string_value: "continuous_speech_embedding_fn.weight"
}
}
parameters {
key: "model_config_path"
value {
string_value: "/tmp/models/agm/preprocessing/1/config.json"
}
}
parameters {
key: "text_modality_indicator_token"
value {
string_value: "0"
}
}
backend: "python"
I0812 16:23:58.164635 2689 model_lifecycle.cc:438] AsyncLoad() 'preprocessing'
I0812 16:23:58.164687 2689 model_lifecycle.cc:469] loading: preprocessing:1
I0812 16:23:58.164743 2689 model_lifecycle.cc:547] CreateModel() 'preprocessing' version 1
I0812 16:23:58.164877 2689 backend_model.cc:502] Adding default backend config setting: default-max-batch-size,4
I0812 16:23:58.164906 2689 shared_library.cc:112] OpenLibraryHandle: /opt/tritonserver/backends/python/libtriton_python.so
I0812 16:23:58.166882 2689 python_be.cc:2067] 'python' TRITONBACKEND API version: 1.18
I0812 16:23:58.166894 2689 python_be.cc:2089] backend configuration:
{"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}}
I0812 16:23:58.166919 2689 python_be.cc:2228] Shared memory configuration is shm-default-byte-size=1048576,shm-growth-byte-size=1048576,stub-timeout-seconds=30
I0812 16:23:58.167080 2689 python_be.cc:2541] TRITONBACKEND_GetBackendAttribute: setting attributes
I0812 16:23:58.169594 2689 python_be.cc:2319] TRITONBACKEND_ModelInitialize: preprocessing (version 1)
I0812 16:23:58.170153 2689 model_config_utils.cc:1902] ModelConfig 64-bit fields:
I0812 16:23:58.170164 2689 model_config_utils.cc:1904] ModelConfig::dynamic_batching::default_priority_level
I0812 16:23:58.170169 2689 model_config_utils.cc:1904] ModelConfig::dynamic_batching::default_queue_policy::default_timeout_microseconds
I0812 16:23:58.170172 2689 model_config_utils.cc:1904] ModelConfig::dynamic_batching::max_queue_delay_microseconds
I0812 16:23:58.170176 2689 model_config_utils.cc:1904] ModelConfig::dynamic_batching::priority_levels
I0812 16:23:58.170179 2689 model_config_utils.cc:1904] ModelConfig::dynamic_batching::priority_queue_policy::key
I0812 16:23:58.170183 2689 model_config_utils.cc:1904] ModelConfig::dynamic_batching::priority_queue_policy::value::default_timeout_microseconds
I0812 16:23:58.170188 2689 model_config_utils.cc:1904] ModelConfig::ensemble_scheduling::step::model_version
I0812 16:23:58.170191 2689 model_config_utils.cc:1904] ModelConfig::input::dims
I0812 16:23:58.170195 2689 model_config_utils.cc:1904] ModelConfig::input::reshape::shape
I0812 16:23:58.170198 2689 model_config_utils.cc:1904] ModelConfig::instance_group::secondary_devices::device_id
I0812 16:23:58.170202 2689 model_config_utils.cc:1904] ModelConfig::model_warmup::inputs::value::dims
I0812 16:23:58.170205 2689 model_config_utils.cc:1904] ModelConfig::optimization::cuda::graph_spec::graph_lower_bound::input::value::dim
I0812 16:23:58.170210 2689 model_config_utils.cc:1904] ModelConfig::optimization::cuda::graph_spec::input::value::dim
I0812 16:23:58.170214 2689 model_config_utils.cc:1904] ModelConfig::output::dims
I0812 16:23:58.170217 2689 model_config_utils.cc:1904] ModelConfig::output::reshape::shape
I0812 16:23:58.170222 2689 model_config_utils.cc:1904] ModelConfig::sequence_batching::direct::max_queue_delay_microseconds
I0812 16:23:58.170226 2689 model_config_utils.cc:1904] ModelConfig::sequence_batching::max_sequence_idle_microseconds
I0812 16:23:58.170229 2689 model_config_utils.cc:1904] ModelConfig::sequence_batching::oldest::max_queue_delay_microseconds
I0812 16:23:58.170234 2689 model_config_utils.cc:1904] ModelConfig::sequence_batching::state::dims
I0812 16:23:58.170239 2689 model_config_utils.cc:1904] ModelConfig::sequence_batching::state::initial_state::dims
I0812 16:23:58.170244 2689 model_config_utils.cc:1904] ModelConfig::version_policy::specific::versions
I0812 16:23:58.170851 2689 stub_launcher.cc:253] Starting Python backend stub: exec /opt/tritonserver/backends/python/triton_python_backend_stub /opt/amazon/alexa_triton_inference_engine/configuration/agm-streaming/preprocessing/1/model.py triton_python_backend_shm_region_1 1048576 1048576 2689 /opt/tritonserver/backends/python 336 preprocessing DEFAULT
[TensorRT-LLM] TensorRT-LLM version: 0.9.0
free(): invalid pointer
[ip-172-31-47-85:02696] *** Process received signal ***
[ip-172-31-47-85:02696] Signal: Aborted (6)
[ip-172-31-47-85:02696] Signal code: (-6)
[ip-172-31-47-85:02696] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f27b0616520]
[ip-172-31-47-85:02696] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f27b066a9fc]
[ip-172-31-47-85:02696] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f27b0616476]
[ip-172-31-47-85:02696] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f27b05fc7f3]
[ip-172-31-47-85:02696] [ 4] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x89676)[0x7f27b065d676]
[ip-172-31-47-85:02696] [ 5] /usr/lib/x86_64-linux-gnu/libc.so.6(+0xa0cfc)[0x7f27b0674cfc]
[ip-172-31-47-85:02696] [ 6] /usr/lib/x86_64-linux-gnu/libc.so.6(+0xa2a44)[0x7f27b0676a44]
[ip-172-31-47-85:02696] [ 7] /usr/lib/x86_64-linux-gnu/libc.so.6(free+0x73)[0x7f27b0679453]
[ip-172-31-47-85:02696] [ 8] /opt/tritonserver/backends/python/triton_python_backend_stub(+0x6fd54)[0x5555bbe28d54]
[ip-172-31-47-85:02696] [ 9] /opt/tritonserver/backends/python/triton_python_backend_stub(+0x25de3)[0x5555bbddede3]
[ip-172-31-47-85:02696] [10] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f27b05fdd90]
[ip-172-31-47-85:02696] [11] /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f27b05fde40]
[ip-172-31-47-85:02696] [12] /opt/tritonserver/backends/python/triton_python_backend_stub(+0x26b45)[0x5555bbddfb45]
[ip-172-31-47-85:02696] *** End of error message ***
I0812 16:24:03.233015 2689 python_be.cc:2023] model configuration:
{
"name": "preprocessing",
"platform": "",
"backend": "python",
"runtime": "",
"version_policy": {
"latest": {
"num_versions": 1
}
},
"max_batch_size": 0,
"input": [
{
"name": "TEXT_TOKENS",
"data_type": "TYPE_INT32",
"format": "FORMAT_NONE",
"dims": [
1,
-1
],
"is_shape_tensor": false,
"allow_ragged_batch": false,
"optional": false
},
{
"name": "SPEECH_EMBEDDINGS",
"data_type": "TYPE_FP32",
"format": "FORMAT_NONE",
"dims": [
1,
-1,
-1
],
"is_shape_tensor": false,
"allow_ragged_batch": false,
"optional": false
},
{
"name": "MODALITY_SEQUENCE",
"data_type": "TYPE_UINT32",
"format": "FORMAT_NONE",
"dims": [
1,
-1
],
"is_shape_tensor": false,
"allow_ragged_batch": false,
"optional": true
}
],
"output": [
{
"name": "INPUT_ID",
"data_type": "TYPE_INT32",
"dims": [
1,
-1
],
"label_filename": "",
"is_shape_tensor": false
},
{
"name": "PROMPT_TABLE",
"data_type": "TYPE_FP16",
"dims": [
-1,
-1
],
"label_filename": "",
"is_shape_tensor": false
},
{
"name": "END_ID",
"data_type": "TYPE_INT32",
"dims": [
1
],
"label_filename": "",
"is_shape_tensor": false
},
{
"name": "PAD_ID",
"data_type": "TYPE_INT32",
"dims": [
1
],
"label_filename": "",
"is_shape_tensor": false
}
],
"batch_input": [],
"batch_output": [],
"optimization": {
"priority": "PRIORITY_DEFAULT",
"input_pinned_memory": {
"enable": true
},
"output_pinned_memory": {
"enable": true
},
"gather_kernel_buffer_threshold": 0,
"eager_batching": false
},
"instance_group": [
{
"name": "preprocessing_0",
"kind": "KIND_CPU",
"count": 1,
"gpus": [],
"secondary_devices": [],
"profile": [],
"passive": false,
"host_policy": ""
}
],
"default_model_filename": "model.py",
"cc_model_filenames": {},
"metric_tags": {},
"parameters": {
"encoder_projections_dir": {
"string_value": "/tmp/models/agm/preprocessing/1/encoder_projection/"
},
"encoder_projections_bias": {
"string_value": "continuous_speech_embedding_fn.bias"
},
"audio_modality_indicator_token": {
"string_value": "1"
},
"model_config_path": {
"string_value": "/tmp/models/agm/preprocessing/1/config.json"
},
"encoder_projections_weight": {
"string_value": "continuous_speech_embedding_fn.weight"
},
"text_modality_indicator_token": {
"string_value": "0"
}
},
"model_warmup": []
}
I0812 16:24:03.233332 2689 python_be.cc:2363] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)
I0812 16:24:03.233373 2689 backend_model_instance.cc:69] Creating instance preprocessing_0_0 on CPU using artifact 'model.py'
I0812 16:24:03.234424 2689 stub_launcher.cc:253] Starting Python backend stub: exec /opt/tritonserver/backends/python/triton_python_backend_stub /opt/amazon/alexa_triton_inference_engine/configuration/agm-streaming/preprocessing/1/model.py triton_python_backend_shm_region_2 1048576 1048576 2689 /opt/tritonserver/backends/python 336 preprocessing_0_0 DEFAULT
Looking into the python backend stub code, I did notice that there's some process fork and IPC that occurs--maybe there's some kind of race condition that gets triggered when running under MPI?
Actually torch
is not the culprit, it is tensorrt_llm.profiler
. I am able to reproduce using the add_sub
example here: https://github.com/triton-inference-server/python_backend/tree/r23.12. I just add import tensorrt_llm.profiler
to the model.py
and run:
mpirun --allow-run-as-root -n 1 tritonserver --model-repository `pwd`/models --log-verbose=3
I then get the hang with the following logs:
I0812 16:46:39.824892 4370 cache_manager.cc:480] Create CacheManager with cache_dir: '/opt/tritonserver/caches'
I0812 16:46:40.276042 4370 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7fcfa0000000' with size 268435456
I0812 16:46:40.278472 4370 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I0812 16:46:40.283647 4370 model_config_utils.cc:680] Server side auto-completed config: name: "add_sub"
input {
name: "INPUT0"
data_type: TYPE_FP32
dims: 4
}
input {
name: "INPUT1"
data_type: TYPE_FP32
dims: 4
}
output {
name: "OUTPUT0"
data_type: TYPE_FP32
dims: 4
}
output {
name: "OUTPUT1"
data_type: TYPE_FP32
dims: 4
}
instance_group {
kind: KIND_CPU
}
default_model_filename: "model.py"
backend: "python"
I0812 16:46:40.283706 4370 model_lifecycle.cc:438] AsyncLoad() 'add_sub'
I0812 16:46:40.283747 4370 model_lifecycle.cc:469] loading: add_sub:1
I0812 16:46:40.283827 4370 model_lifecycle.cc:547] CreateModel() 'add_sub' version 1
I0812 16:46:40.283968 4370 backend_model.cc:502] Adding default backend config setting: default-max-batch-size,4
I0812 16:46:40.283995 4370 shared_library.cc:112] OpenLibraryHandle: /opt/tritonserver/backends/python/libtriton_python.so
I0812 16:46:40.285867 4370 python_be.cc:2067] 'python' TRITONBACKEND API version: 1.18
I0812 16:46:40.285881 4370 python_be.cc:2089] backend configuration:
{"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}}
I0812 16:46:40.285909 4370 python_be.cc:2228] Shared memory configuration is shm-default-byte-size=1048576,shm-growth-byte-size=1048576,stub-timeout-seconds=30
I0812 16:46:40.286086 4370 python_be.cc:2541] TRITONBACKEND_GetBackendAttribute: setting attributes
I0812 16:46:40.288612 4370 python_be.cc:2319] TRITONBACKEND_ModelInitialize: add_sub (version 1)
I0812 16:46:40.289090 4370 model_config_utils.cc:1902] ModelConfig 64-bit fields:
I0812 16:46:40.289100 4370 model_config_utils.cc:1904] ModelConfig::dynamic_batching::default_priority_level
I0812 16:46:40.289104 4370 model_config_utils.cc:1904] ModelConfig::dynamic_batching::default_queue_policy::default_timeout_microseconds
I0812 16:46:40.289109 4370 model_config_utils.cc:1904] ModelConfig::dynamic_batching::max_queue_delay_microseconds
I0812 16:46:40.289112 4370 model_config_utils.cc:1904] ModelConfig::dynamic_batching::priority_levels
I0812 16:46:40.289117 4370 model_config_utils.cc:1904] ModelConfig::dynamic_batching::priority_queue_policy::key
I0812 16:46:40.289121 4370 model_config_utils.cc:1904] ModelConfig::dynamic_batching::priority_queue_policy::value::default_timeout_microseconds
I0812 16:46:40.289124 4370 model_config_utils.cc:1904] ModelConfig::ensemble_scheduling::step::model_version
I0812 16:46:40.289128 4370 model_config_utils.cc:1904] ModelConfig::input::dims
I0812 16:46:40.289131 4370 model_config_utils.cc:1904] ModelConfig::input::reshape::shape
I0812 16:46:40.289135 4370 model_config_utils.cc:1904] ModelConfig::instance_group::secondary_devices::device_id
I0812 16:46:40.289138 4370 model_config_utils.cc:1904] ModelConfig::model_warmup::inputs::value::dims
I0812 16:46:40.289142 4370 model_config_utils.cc:1904] ModelConfig::optimization::cuda::graph_spec::graph_lower_bound::input::value::dim
I0812 16:46:40.289145 4370 model_config_utils.cc:1904] ModelConfig::optimization::cuda::graph_spec::input::value::dim
I0812 16:46:40.289149 4370 model_config_utils.cc:1904] ModelConfig::output::dims
I0812 16:46:40.289152 4370 model_config_utils.cc:1904] ModelConfig::output::reshape::shape
I0812 16:46:40.289155 4370 model_config_utils.cc:1904] ModelConfig::sequence_batching::direct::max_queue_delay_microseconds
I0812 16:46:40.289159 4370 model_config_utils.cc:1904] ModelConfig::sequence_batching::max_sequence_idle_microseconds
I0812 16:46:40.289163 4370 model_config_utils.cc:1904] ModelConfig::sequence_batching::oldest::max_queue_delay_microseconds
I0812 16:46:40.289167 4370 model_config_utils.cc:1904] ModelConfig::sequence_batching::state::dims
I0812 16:46:40.289170 4370 model_config_utils.cc:1904] ModelConfig::sequence_batching::state::initial_state::dims
I0812 16:46:40.289173 4370 model_config_utils.cc:1904] ModelConfig::version_policy::specific::versions
I0812 16:46:40.289739 4370 stub_launcher.cc:253] Starting Python backend stub: exec /opt/tritonserver/backends/python/triton_python_backend_stub /opt/tritonserver/python_backend/models/add_sub/1/model.py triton_python_backend_shm_region_1 1048576 1048576 4370 /opt/tritonserver/backends/python 336 add_sub DEFAULT
[TensorRT-LLM] TensorRT-LLM version: 0.9.0
free(): invalid pointer
[ip-172-31-47-85:04380] *** Process received signal ***
[ip-172-31-47-85:04380] Signal: Aborted (6)
[ip-172-31-47-85:04380] Signal code: (-6)
[ip-172-31-47-85:04380] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f50bd016520]
[ip-172-31-47-85:04380] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f50bd06a9fc]
[ip-172-31-47-85:04380] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f50bd016476]
[ip-172-31-47-85:04380] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f50bcffc7f3]
[ip-172-31-47-85:04380] [ 4] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x89676)[0x7f50bd05d676]
[ip-172-31-47-85:04380] [ 5] /usr/lib/x86_64-linux-gnu/libc.so.6(+0xa0cfc)[0x7f50bd074cfc]
[ip-172-31-47-85:04380] [ 6] /usr/lib/x86_64-linux-gnu/libc.so.6(+0xa2a44)[0x7f50bd076a44]
[ip-172-31-47-85:04380] [ 7] /usr/lib/x86_64-linux-gnu/libc.so.6(free+0x73)[0x7f50bd079453]
[ip-172-31-47-85:04380] [ 8] /opt/tritonserver/backends/python/triton_python_backend_stub(+0x6fd54)[0x55d6e7b2ad54]
[ip-172-31-47-85:04380] [ 9] /opt/tritonserver/backends/python/triton_python_backend_stub(+0x25de3)[0x55d6e7ae0de3]
[ip-172-31-47-85:04380] [10] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f50bcffdd90]
[ip-172-31-47-85:04380] [11] /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f50bcffde40]
[ip-172-31-47-85:04380] [12] /opt/tritonserver/backends/python/triton_python_backend_stub(+0x26b45)[0x55d6e7ae1b45]
[ip-172-31-47-85:04380] *** End of error message ***
I0812 16:46:45.340856 4370 python_be.cc:2023] model configuration:
{
"name": "add_sub",
"platform": "",
"backend": "python",
"runtime": "",
"version_policy": {
"latest": {
"num_versions": 1
}
},
"max_batch_size": 0,
"input": [
{
"name": "INPUT0",
"data_type": "TYPE_FP32",
"format": "FORMAT_NONE",
"dims": [
4
],
"is_shape_tensor": false,
"allow_ragged_batch": false,
"optional": false
},
{
"name": "INPUT1",
"data_type": "TYPE_FP32",
"format": "FORMAT_NONE",
"dims": [
4
],
"is_shape_tensor": false,
"allow_ragged_batch": false,
"optional": false
}
],
"output": [
{
"name": "OUTPUT0",
"data_type": "TYPE_FP32",
"dims": [
4
],
"label_filename": "",
"is_shape_tensor": false
},
{
"name": "OUTPUT1",
"data_type": "TYPE_FP32",
"dims": [
4
],
"label_filename": "",
"is_shape_tensor": false
}
],
"batch_input": [],
"batch_output": [],
"optimization": {
"priority": "PRIORITY_DEFAULT",
"input_pinned_memory": {
"enable": true
},
"output_pinned_memory": {
"enable": true
},
"gather_kernel_buffer_threshold": 0,
"eager_batching": false
},
"instance_group": [
{
"name": "add_sub_0",
"kind": "KIND_CPU",
"count": 1,
"gpus": [],
"secondary_devices": [],
"profile": [],
"passive": false,
"host_policy": ""
}
],
"default_model_filename": "model.py",
"cc_model_filenames": {},
"metric_tags": {},
"parameters": {},
"model_warmup": []
}
I0812 16:46:45.341147 4370 python_be.cc:2363] TRITONBACKEND_ModelInstanceInitialize: add_sub_0_0 (CPU device 0)
I0812 16:46:45.341185 4370 backend_model_instance.cc:69] Creating instance add_sub_0_0 on CPU using artifact 'model.py'
I0812 16:46:45.342043 4370 stub_launcher.cc:253] Starting Python backend stub: exec /opt/tritonserver/backends/python/triton_python_backend_stub /opt/tritonserver/python_backend/models/add_sub/1/model.py triton_python_backend_shm_region_2 1048576 1048576 4370 /opt/tritonserver/backends/python 336 add_sub_0_0 DEFAULT
@Tabrizian @tanmayv25 Any ideas?
Sorry, I realized I was using our own modified container in my runs above, so I tried again with nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3
(as @DZADSL72-00558 did). The output looks different (there's no SIGABRT
in the logs) but the hang is still there:
# mpirun --allow-run-as-root -n 1 tritonserver --model-repository `pwd`/models --log-verbose=3
I0812 19:08:32.367035 2315 cache_manager.cc:480] "Create CacheManager with cache_dir: '/opt/tritonserver/caches'"
I0812 19:08:35.061925 2315 pinned_memory_manager.cc:275] "Pinned memory pool is created at '0x7f0bb2000000' with size 268435456"
I0812 19:08:35.097474 2315 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 0 with size 67108864"
I0812 19:08:35.097487 2315 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 1 with size 67108864"
I0812 19:08:35.097493 2315 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 2 with size 67108864"
I0812 19:08:35.097497 2315 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 3 with size 67108864"
I0812 19:08:35.097502 2315 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 4 with size 67108864"
I0812 19:08:35.097506 2315 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 5 with size 67108864"
I0812 19:08:35.097511 2315 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 6 with size 67108864"
I0812 19:08:35.097515 2315 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 7 with size 67108864"
I0812 19:08:36.536436 2315 model_config_utils.cc:681] "Server side auto-completed config: "
name: "add_sub"
input {
name: "INPUT0"
data_type: TYPE_FP32
dims: 4
}
input {
name: "INPUT1"
data_type: TYPE_FP32
dims: 4
}
output {
name: "OUTPUT0"
data_type: TYPE_FP32
dims: 4
}
output {
name: "OUTPUT1"
data_type: TYPE_FP32
dims: 4
}
instance_group {
kind: KIND_CPU
}
default_model_filename: "model.py"
backend: "python"
I0812 19:08:36.536499 2315 model_lifecycle.cc:441] "AsyncLoad() 'add_sub'"
I0812 19:08:36.536538 2315 model_lifecycle.cc:472] "loading: add_sub:1"
I0812 19:08:36.536596 2315 model_lifecycle.cc:550] "CreateModel() 'add_sub' version 1"
I0812 19:08:36.536715 2315 backend_model.cc:503] "Adding default backend config setting: default-max-batch-size,4"
I0812 19:08:36.536736 2315 shared_library.cc:112] "OpenLibraryHandle: /opt/tritonserver/backends/python/libtriton_python.so"
I0812 19:08:36.537937 2315 python_be.cc:2099] "'python' TRITONBACKEND API version: 1.19"
I0812 19:08:36.537951 2315 python_be.cc:2121] "backend configuration:\n{\"cmdline\":{\"auto-complete-config\":\"true\",\"backend-directory\":\"/opt/tritonserver/backends\",\"min-compute-capability\":\"6.000000\",\"default-max-batch-size\":\"4\"}}"
I0812 19:08:36.537971 2315 python_be.cc:2259] "Shared memory configuration is shm-default-byte-size=1048576,shm-growth-byte-size=1048576,stub-timeout-seconds=30"
I0812 19:08:36.538131 2315 python_be.cc:2582] "TRITONBACKEND_GetBackendAttribute: setting attributes"
I0812 19:08:36.558044 2315 python_be.cc:2360] "TRITONBACKEND_ModelInitialize: add_sub (version 1)"
I0812 19:08:36.558491 2315 model_config_utils.cc:1902] "ModelConfig 64-bit fields:"
I0812 19:08:36.558505 2315 model_config_utils.cc:1904] "\tModelConfig::dynamic_batching::default_priority_level"
I0812 19:08:36.558510 2315 model_config_utils.cc:1904] "\tModelConfig::dynamic_batching::default_queue_policy::default_timeout_microseconds"
I0812 19:08:36.558514 2315 model_config_utils.cc:1904] "\tModelConfig::dynamic_batching::max_queue_delay_microseconds"
I0812 19:08:36.558519 2315 model_config_utils.cc:1904] "\tModelConfig::dynamic_batching::priority_levels"
I0812 19:08:36.558524 2315 model_config_utils.cc:1904] "\tModelConfig::dynamic_batching::priority_queue_policy::key"
I0812 19:08:36.558529 2315 model_config_utils.cc:1904] "\tModelConfig::dynamic_batching::priority_queue_policy::value::default_timeout_microseconds"
I0812 19:08:36.558534 2315 model_config_utils.cc:1904] "\tModelConfig::ensemble_scheduling::step::model_version"
I0812 19:08:36.558538 2315 model_config_utils.cc:1904] "\tModelConfig::input::dims"
I0812 19:08:36.558542 2315 model_config_utils.cc:1904] "\tModelConfig::input::reshape::shape"
I0812 19:08:36.558547 2315 model_config_utils.cc:1904] "\tModelConfig::instance_group::secondary_devices::device_id"
I0812 19:08:36.558553 2315 model_config_utils.cc:1904] "\tModelConfig::model_warmup::inputs::value::dims"
I0812 19:08:36.558557 2315 model_config_utils.cc:1904] "\tModelConfig::optimization::cuda::graph_spec::graph_lower_bound::input::value::dim"
I0812 19:08:36.558562 2315 model_config_utils.cc:1904] "\tModelConfig::optimization::cuda::graph_spec::input::value::dim"
I0812 19:08:36.558566 2315 model_config_utils.cc:1904] "\tModelConfig::output::dims"
I0812 19:08:36.558570 2315 model_config_utils.cc:1904] "\tModelConfig::output::reshape::shape"
I0812 19:08:36.558575 2315 model_config_utils.cc:1904] "\tModelConfig::sequence_batching::direct::max_queue_delay_microseconds"
I0812 19:08:36.558579 2315 model_config_utils.cc:1904] "\tModelConfig::sequence_batching::max_sequence_idle_microseconds"
I0812 19:08:36.558583 2315 model_config_utils.cc:1904] "\tModelConfig::sequence_batching::oldest::max_queue_delay_microseconds"
I0812 19:08:36.558588 2315 model_config_utils.cc:1904] "\tModelConfig::sequence_batching::state::dims"
I0812 19:08:36.558592 2315 model_config_utils.cc:1904] "\tModelConfig::sequence_batching::state::initial_state::dims"
I0812 19:08:36.558596 2315 model_config_utils.cc:1904] "\tModelConfig::version_policy::specific::versions"
I0812 19:08:36.559159 2315 stub_launcher.cc:385] "Starting Python backend stub: exec /opt/tritonserver/backends/python/triton_python_backend_stub /opt/tritonserver/python_backend/models/add_sub/1/model.py triton_python_backend_shm_region_fb2152c7-cf8e-4d73-a098-1112d6be7786 1048576 1048576 2315 /opt/tritonserver/backends/python 336 add_sub DEFAULT"
[TensorRT-LLM] TensorRT-LLM version: 0.9.0
I0812 19:08:40.998141 2315 python_be.cc:2055] "model configuration:\n{\n \"name\": \"add_sub\",\n \"platform\": \"\",\n \"backend\": \"python\",\n \"runtime\": \"\",\n \"version_policy\": {\n \"latest\": {\n \"num_versions\": 1\n }\n },\n \"max_batch_size\": 0,\n \"input\": [\n {\n \"name\": \"INPUT0\",\n \"data_type\": \"TYPE_FP32\",\n \"format\": \"FORMAT_NONE\",\n \"dims\": [\n 4\n ],\n \"is_shape_tensor\": false,\n \"allow_ragged_batch\": false,\n \"optional\": false\n },\n {\n \"name\": \"INPUT1\",\n \"data_type\": \"TYPE_FP32\",\n \"format\": \"FORMAT_NONE\",\n \"dims\": [\n 4\n ],\n \"is_shape_tensor\": false,\n \"allow_ragged_batch\": false,\n \"optional\": false\n }\n ],\n \"output\": [\n {\n \"name\": \"OUTPUT0\",\n \"data_type\": \"TYPE_FP32\",\n \"dims\": [\n 4\n ],\n \"label_filename\": \"\",\n \"is_shape_tensor\": false\n },\n {\n \"name\": \"OUTPUT1\",\n \"data_type\": \"TYPE_FP32\",\n \"dims\": [\n 4\n ],\n \"label_filename\": \"\",\n \"is_shape_tensor\": false\n }\n ],\n \"batch_input\": [],\n \"batch_output\": [],\n \"optimization\": {\n \"priority\": \"PRIORITY_DEFAULT\",\n \"input_pinned_memory\": {\n \"enable\": true\n },\n \"output_pinned_memory\": {\n \"enable\": true\n },\n \"gather_kernel_buffer_threshold\": 0,\n \"eager_batching\": false\n },\n \"instance_group\": [\n {\n \"name\": \"add_sub_0\",\n \"kind\": \"KIND_CPU\",\n \"count\": 1,\n \"gpus\": [],\n \"secondary_devices\": [],\n \"profile\": [],\n \"passive\": false,\n \"host_policy\": \"\"\n }\n ],\n \"default_model_filename\": \"model.py\",\n \"cc_model_filenames\": {},\n \"metric_tags\": {},\n \"parameters\": {},\n \"model_warmup\": []\n}"
I0812 19:08:40.998555 2315 python_be.cc:2404] "TRITONBACKEND_ModelInstanceInitialize: add_sub_0_0 (CPU device 0)"
I0812 19:08:40.998593 2315 backend_model_instance.cc:69] "Creating instance add_sub_0_0 on CPU using artifact 'model.py'"
I0812 19:08:40.999266 2315 stub_launcher.cc:385] "Starting Python backend stub: exec /opt/tritonserver/backends/python/triton_python_backend_stub /opt/tritonserver/python_backend/models/add_sub/1/model.py triton_python_backend_shm_region_4ece1248-92b5-467e-a857-bfaa256bbdf2 1048576 1048576 2315 /opt/tritonserver/backends/python 336 add_sub_0_0 DEFAULT"
I have a workaround, adding --disable-auto-complete-config
to the tritonserver
invocation avoids the hang (and also makes the SIGABRT go away in our custom container). This unblocks us, but I will leave it to the Nvidia side to decide if you want to close it or pursue the root cause.
Hi @snjoseph ,
I tried adding import tensorrt_llm.profiler
to the add_sub example, and ran the command:
mpirun --allow-run-as-root -n 1 tritonserver --model-repository `pwd`/add_sub --log-verbose=3
It doesn't hang there but gave me the below error.
orte_ess_init failed
--> Returned value No permission (-17) instead of ORTE_SUCCES
The docker container is the same one mentioned above.
I've tested A100 80GB and NVIDIA H100 80GB HBM3. Adding --disable-auto-complete-config
does solve the above issue.
System Info
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
error free
actual behavior
run into error
And this is the initialize function
additional notes
Anything could be wrong in our code?
I am using an ensemble model