Server stuck after `Starting Python backend stub`

DZADSL72-00558 commented 2 months ago

System Info

Instance type: https://aws.amazon.com/ec2/instance-types/p5/
- check for hardware info, like CPU/Hram/GPU

Who can help?

No response

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

container: 24.05-trtllm-python-py3 from nvcr.io/nvidia/tritonserver
model: git@hf.co:meta-llama/Llama-2-7b-hf

cmd

export TP=2
export PP=1
cd /opt/tritonserver/TensorRT-LLM/examples/llama
python3 convert_checkpoint.py --model_dir /tmp/models/Llama-2-7b-hf/ --output_dir /tmp/models/Llama-2-7b-hf/ckpt_${TP}_${PP}/ --dtype float16 --tp_size ${TP} --pp_size ${PP}
trtllm-build --checkpoint_dir /tmp/models/Llama-2-7b-hf/ckpt_${TP}_${PP}/ --gemm_plugin float16 --gpt_attention_plugin float16 --max_batch_size 1 --output_dir /tmp/models/Llama-2-7b-hf/1/engine_${TP}_${PP}/  --max_beam_width 1 --max_input_len 8192 --max_output_len 8192 --max_num_tokens 16384 --max_prompt_embedding_table_size 8192 --context_fmha enable --remove_input_padding enable --bert_attention_plugin float16 --paged_kv_cache enable --use_custom_all_reduce disable --use_paged_context_fmha enable 
rm /tmp/models/agm/tensorrt_llm/1/engine
ln -sf /tmp/models/Llama-2-7b-hf/1/engine_${TP}_${PP} /tmp/models/agm/tensorrt_llm/1/engine
ls -l /tmp/models/agm/tensorrt_llm/1/
mpirun --allow-run-as-root --oversubscribe -n ${TP} tritonserver --model-repository=/tmp/configuration/agm --log-verbose 2>&1 | tee log.log

Expected behavior

error free

actual behavior

run into error

I0801 06:25:54.172709 2033 cache_manager.cc:480] "Create CacheManager with cache_dir: '/opt/tritonserver/caches'"
I0801 06:25:54.174464 2034 cache_manager.cc:480] "Create CacheManager with cache_dir: '/opt/tritonserver/caches'"
I0801 06:25:57.316015 2033 pinned_memory_manager.cc:275] "Pinned memory pool is created at '0x7f8eb2000000' with size 268435456"
I0801 06:25:57.316177 2034 pinned_memory_manager.cc:275] "Pinned memory pool is created at '0x7f7c70000000' with size 268435456"
I0801 06:25:57.385609 2033 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 0 with size 67108864"
I0801 06:25:57.385626 2033 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 1 with size 67108864"
I0801 06:25:57.385632 2033 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 2 with size 67108864"
I0801 06:25:57.385636 2033 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 3 with size 67108864"
I0801 06:25:57.385640 2033 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 4 with size 67108864"
I0801 06:25:57.385645 2033 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 5 with size 67108864"
I0801 06:25:57.385649 2033 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 6 with size 67108864"
I0801 06:25:57.385654 2033 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 7 with size 67108864"
I0801 06:25:57.386515 2034 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 0 with size 67108864"
I0801 06:25:57.386554 2034 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 1 with size 67108864"
I0801 06:25:57.386560 2034 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 2 with size 67108864"
I0801 06:25:57.386564 2034 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 3 with size 67108864"
I0801 06:25:57.386571 2034 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 4 with size 67108864"
I0801 06:25:57.386575 2034 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 5 with size 67108864"
I0801 06:25:57.386580 2034 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 6 with size 67108864"
I0801 06:25:57.386585 2034 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 7 with size 67108864"

I0801 06:25:59.826954 2034 model_lifecycle.cc:472] "loading: tensorrt_llm:1"
I0801 06:25:59.826994 2034 model_lifecycle.cc:472] "loading: preprocessing:1"
I0801 06:25:59.827030 2034 model_lifecycle.cc:472] "loading: postprocessing:1"
I0801 06:25:59.827157 2034 backend_model.cc:503] "Adding default backend config setting: default-max-batch-size,4"
I0801 06:25:59.827159 2034 backend_model.cc:503] "Adding default backend config setting: default-max-batch-size,4"
I0801 06:25:59.827204 2034 shared_library.cc:112] "OpenLibraryHandle: /opt/tritonserver/backends/python/libtriton_python.so"
I0801 06:25:59.827197 2034 backend_model.cc:503] "Adding default backend config setting: default-max-batch-size,4"
I0801 06:25:59.828675 2034 python_be.cc:2099] "'python' TRITONBACKEND API version: 1.19"
I0801 06:25:59.828690 2034 python_be.cc:2121] "backend configuration:\n{\"cmdline\":{\"auto-complete-config\":\"true\",\"backend-directory\":\"/opt/tritonserver/backends\",\"min-compute-capability\":\"6.000000\",\"default-max-batch-size\":\"4\"}}"
I0801 06:25:59.828715 2034 python_be.cc:2259] "Shared memory configuration is shm-default-byte-size=1048576,shm-growth-byte-size=1048576,stub-timeout-seconds=30"
I0801 06:25:59.828862 2034 python_be.cc:2582] "TRITONBACKEND_GetBackendAttribute: setting attributes"
I0801 06:25:59.884107 2034 python_be.cc:2360] "TRITONBACKEND_ModelInitialize: preprocessing (version 1)"
I0801 06:25:59.884705 2034 model_config_utils.cc:1902] "ModelConfig 64-bit fields:"
I0801 06:25:59.884716 2034 model_config_utils.cc:1904] "\tModelConfig::dynamic_batching::default_priority_level"
I0801 06:25:59.884722 2034 model_config_utils.cc:1904] "\tModelConfig::dynamic_batching::default_queue_policy::default_timeout_microseconds"
I0801 06:25:59.884728 2034 model_config_utils.cc:1904] "\tModelConfig::dynamic_batching::max_queue_delay_microseconds"
I0801 06:25:59.884733 2034 model_config_utils.cc:1904] "\tModelConfig::dynamic_batching::priority_levels"
I0801 06:25:59.884737 2034 model_config_utils.cc:1904] "\tModelConfig::dynamic_batching::priority_queue_policy::key"
I0801 06:25:59.884742 2034 model_config_utils.cc:1904] "\tModelConfig::dynamic_batching::priority_queue_policy::value::default_timeout_microseconds"
I0801 06:25:59.884748 2034 model_config_utils.cc:1904] "\tModelConfig::ensemble_scheduling::step::model_version"
I0801 06:25:59.884753 2034 model_config_utils.cc:1904] "\tModelConfig::input::dims"
I0801 06:25:59.884758 2034 model_config_utils.cc:1904] "\tModelConfig::input::reshape::shape"
I0801 06:25:59.884763 2034 model_config_utils.cc:1904] "\tModelConfig::instance_group::secondary_devices::device_id"
I0801 06:25:59.884768 2034 model_config_utils.cc:1904] "\tModelConfig::model_warmup::inputs::value::dims"
I0801 06:25:59.884772 2034 model_config_utils.cc:1904] "\tModelConfig::optimization::cuda::graph_spec::graph_lower_bound::input::value::dim"
I0801 06:25:59.884778 2034 model_config_utils.cc:1904] "\tModelConfig::optimization::cuda::graph_spec::input::value::dim"
I0801 06:25:59.884783 2034 model_config_utils.cc:1904] "\tModelConfig::output::dims"
I0801 06:25:59.884788 2034 model_config_utils.cc:1904] "\tModelConfig::output::reshape::shape"
I0801 06:25:59.884792 2034 model_config_utils.cc:1904] "\tModelConfig::sequence_batching::direct::max_queue_delay_microseconds"
I0801 06:25:59.884798 2034 model_config_utils.cc:1904] "\tModelConfig::sequence_batching::max_sequence_idle_microseconds"
I0801 06:25:59.884802 2034 model_config_utils.cc:1904] "\tModelConfig::sequence_batching::oldest::max_queue_delay_microseconds"
I0801 06:25:59.884807 2034 model_config_utils.cc:1904] "\tModelConfig::sequence_batching::state::dims"
I0801 06:25:59.884813 2034 model_config_utils.cc:1904] "\tModelConfig::sequence_batching::state::initial_state::dims"
I0801 06:25:59.884819 2034 model_config_utils.cc:1904] "\tModelConfig::version_policy::specific::versions"
I0801 06:25:59.885425 2034 stub_launcher.cc:385] "Starting Python backend stub:  exec /opt/tritonserver/backends/python/triton_python_backend_stub /tmp/configuration/agm/preprocessing/1/model.py triton_python_backend_shm_region_d2867432-e05c-45e3-a9ca-5981e970aa74 1048576 1048576 2034 /opt/tritonserver/backends/python 336 preprocessing DEFAULT"
I0801 06:25:59.912834 2034 python_be.cc:2360] "TRITONBACKEND_ModelInitialize: tensorrt_llm (version 1)"
I0801 06:25:59.913639 2034 python_be.cc:2000] "Input tensors can be both in CPU and GPU. FORCE_CPU_ONLY_INPUT_TENSORS is off."
I0801 06:25:59.914317 2034 stub_launcher.cc:385] "Starting Python backend stub:  exec /opt/tritonserver/backends/python/triton_python_backend_stub /tmp/configuration/agm/tensorrt_llm/1/model.py triton_python_backend_shm_region_28ea182c-75b6-4bca-92a1-3f822268bab3 1048576 1048576 2034 /opt/tritonserver/backends/python 336 tensorrt_llm DEFAULT"
I0801 06:25:59.944784 2034 python_be.cc:2360] "TRITONBACKEND_ModelInitialize: postprocessing (version 1)"
I0801 06:25:59.945925 2034 stub_launcher.cc:385] "Starting Python backend stub:  exec /opt/tritonserver/backends/python/triton_python_backend_stub /tmp/configuration/agm/postprocessing/1/model.py triton_python_backend_shm_region_0c267e93-f463-4612-864a-3da0479ba100 1048576 1048576 2034 /opt/tritonserver/backends/python 336 postprocessing DEFAULT"
I0801 06:26:00.058782 2033 model_lifecycle.cc:472] "loading: tensorrt_llm:1"
I0801 06:26:00.058892 2033 model_lifecycle.cc:472] "loading: preprocessing:1"
I0801 06:26:00.058996 2033 model_lifecycle.cc:472] "loading: postprocessing:1"
I0801 06:26:00.061899 2033 backend_model.cc:503] "Adding default backend config setting: default-max-batch-size,4"
I0801 06:26:00.061988 2033 shared_library.cc:112] "OpenLibraryHandle: /opt/tritonserver/backends/python/libtriton_python.so"
I0801 06:26:00.063413 2033 python_be.cc:2099] "'python' TRITONBACKEND API version: 1.19"
I0801 06:26:00.063487 2033 python_be.cc:2121] "backend configuration:\n{\"cmdline\":{\"auto-complete-config\":\"true\",\"backend-directory\":\"/opt/tritonserver/backends\",\"min-compute-capability\":\"6.000000\",\"default-max-batch-size\":\"4\"}}"
I0801 06:26:00.063599 2033 python_be.cc:2259] "Shared memory configuration is shm-default-byte-size=1048576,shm-growth-byte-size=1048576,stub-timeout-seconds=30"
I0801 06:26:00.063894 2033 python_be.cc:2582] "TRITONBACKEND_GetBackendAttribute: setting attributes"
I0801 06:26:00.064020 2033 backend_model.cc:503] "Adding default backend config setting: default-max-batch-size,4"
I0801 06:26:00.063948 2033 backend_model.cc:503] "Adding default backend config setting: default-max-batch-size,4"
I0801 06:26:00.094162 2033 python_be.cc:2360] "TRITONBACKEND_ModelInitialize: tensorrt_llm (version 1)"
I0801 06:26:00.094876 2033 model_config_utils.cc:1902] "ModelConfig 64-bit fields:"
I0801 06:26:00.094896 2033 model_config_utils.cc:1904] "\tModelConfig::dynamic_batching::default_priority_level"
I0801 06:26:00.094902 2033 model_config_utils.cc:1904] "\tModelConfig::dynamic_batching::default_queue_policy::default_timeout_microseconds"
I0801 06:26:00.094908 2033 model_config_utils.cc:1904] "\tModelConfig::dynamic_batching::max_queue_delay_microseconds"
I0801 06:26:00.094914 2033 model_config_utils.cc:1904] "\tModelConfig::dynamic_batching::priority_levels"
I0801 06:26:00.094920 2033 model_config_utils.cc:1904] "\tModelConfig::dynamic_batching::priority_queue_policy::key"
I0801 06:26:00.094925 2033 model_config_utils.cc:1904] "\tModelConfig::dynamic_batching::priority_queue_policy::value::default_timeout_microseconds"
I0801 06:26:00.094931 2033 model_config_utils.cc:1904] "\tModelConfig::ensemble_scheduling::step::model_version"
I0801 06:26:00.094937 2033 model_config_utils.cc:1904] "\tModelConfig::input::dims"
I0801 06:26:00.094942 2033 model_config_utils.cc:1904] "\tModelConfig::input::reshape::shape"
I0801 06:26:00.094948 2033 model_config_utils.cc:1904] "\tModelConfig::instance_group::secondary_devices::device_id"
I0801 06:26:00.094954 2033 model_config_utils.cc:1904] "\tModelConfig::model_warmup::inputs::value::dims"
I0801 06:26:00.094959 2033 model_config_utils.cc:1904] "\tModelConfig::optimization::cuda::graph_spec::graph_lower_bound::input::value::dim"
I0801 06:26:00.094965 2033 model_config_utils.cc:1904] "\tModelConfig::optimization::cuda::graph_spec::input::value::dim"
I0801 06:26:00.094971 2033 model_config_utils.cc:1904] "\tModelConfig::output::dims"
I0801 06:26:00.095012 2033 model_config_utils.cc:1904] "\tModelConfig::output::reshape::shape"
I0801 06:26:00.095019 2033 model_config_utils.cc:1904] "\tModelConfig::sequence_batching::direct::max_queue_delay_microseconds"
I0801 06:26:00.095025 2033 model_config_utils.cc:1904] "\tModelConfig::sequence_batching::max_sequence_idle_microseconds"
I0801 06:26:00.095055 2033 model_config_utils.cc:1904] "\tModelConfig::sequence_batching::oldest::max_queue_delay_microseconds"
I0801 06:26:00.095127 2033 model_config_utils.cc:1904] "\tModelConfig::sequence_batching::state::dims"
I0801 06:26:00.095135 2033 model_config_utils.cc:1904] "\tModelConfig::sequence_batching::state::initial_state::dims"
I0801 06:26:00.095141 2033 model_config_utils.cc:1904] "\tModelConfig::version_policy::specific::versions"
I0801 06:26:00.095381 2033 python_be.cc:2000] "Input tensors can be both in CPU and GPU. FORCE_CPU_ONLY_INPUT_TENSORS is off."
I0801 06:26:00.095994 2033 stub_launcher.cc:385] "Starting Python backend stub:  exec /opt/tritonserver/backends/python/triton_python_backend_stub /tmp/configuration/agm/tensorrt_llm/1/model.py triton_python_backend_shm_region_7c4425ee-c5d4-43ef-b8ec-e7780eec64d6 1048576 1048576 2033 /opt/tritonserver/backends/python 336 tensorrt_llm DEFAULT"
I0801 06:26:00.101748 2033 python_be.cc:2360] "TRITONBACKEND_ModelInitialize: postprocessing (version 1)"
I0801 06:26:00.102797 2033 stub_launcher.cc:385] "Starting Python backend stub:  exec /opt/tritonserver/backends/python/triton_python_backend_stub /tmp/configuration/agm/postprocessing/1/model.py triton_python_backend_shm_region_e43894b6-d7e0-481a-a676-a16c87c83550 1048576 1048576 2033 /opt/tritonserver/backends/python 336 postprocessing DEFAULT"
I0801 06:26:00.103224 2033 python_be.cc:2360] "TRITONBACKEND_ModelInitialize: preprocessing (version 1)"
I0801 06:26:00.104251 2033 stub_launcher.cc:385] "Starting Python backend stub:  exec /opt/tritonserver/backends/python/triton_python_backend_stub /tmp/configuration/agm/preprocessing/1/model.py triton_python_backend_shm_region_cbfe1f0b-dbe6-436c-827c-dba0154d94e7 1048576 1048576 2033 /opt/tritonserver/backends/python 336 preprocessing DEFAULT"
I0801 06:26:01.455959 2034 python_be.cc:2404] "TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)"
I0801 06:26:01.456000 2034 backend_model_instance.cc:69] "Creating instance postprocessing_0_0 on CPU using artifact 'model.py'"
I0801 06:26:01.456074 2034 python_be.cc:2404] "TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_1 (CPU device 0)"
I0801 06:26:01.456119 2034 backend_model_instance.cc:69] "Creating instance postprocessing_0_1 on CPU using artifact 'model.py'"
I0801 06:26:01.456669 2034 stub_launcher.cc:385] "Starting Python backend stub:  exec /opt/tritonserver/backends/python/triton_python_backend_stub /tmp/configuration/agm/postprocessing/1/model.py triton_python_backend_shm_region_5f3e4092-78c1-44b7-8d76-b592365cfdae 1048576 1048576 2034 /opt/tritonserver/backends/python 336 postprocessing_0_0 DEFAULT"
I0801 06:26:01.456775 2034 stub_launcher.cc:385] "Starting Python backend stub:  exec /opt/tritonserver/backends/python/triton_python_backend_stub /tmp/configuration/agm/postprocessing/1/model.py triton_python_backend_shm_region_d2edec3d-7cb6-48b5-8080-503741f7da61 1048576 1048576 2034 /opt/tritonserver/backends/python 336 postprocessing_0_1 DEFAULT"
I0801 06:26:01.637527 2033 python_be.cc:2404] "TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)"
I0801 06:26:01.637563 2033 backend_model_instance.cc:69] "Creating instance postprocessing_0_0 on CPU using artifact 'model.py'"
I0801 06:26:01.637598 2033 python_be.cc:2404] "TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_1 (CPU device 0)"
I0801 06:26:01.637622 2033 backend_model_instance.cc:69] "Creating instance postprocessing_0_1 on CPU using artifact 'model.py'"
I0801 06:26:01.638312 2033 stub_launcher.cc:385] "Starting Python backend stub:  exec /opt/tritonserver/backends/python/triton_python_backend_stub /tmp/configuration/agm/postprocessing/1/model.py triton_python_backend_shm_region_cf84be37-0c35-4cf6-bd96-38a62172ab80 1048576 1048576 2033 /opt/tritonserver/backends/python 336 postprocessing_0_1 DEFAULT"
I0801 06:26:01.638343 2033 stub_launcher.cc:385] "Starting Python backend stub:  exec /opt/tritonserver/backends/python/triton_python_backend_stub /tmp/configuration/agm/postprocessing/1/model.py triton_python_backend_shm_region_cc11a1a7-ba81-491e-b65f-2138989f60f4 1048576 1048576 2033 /opt/tritonserver/backends/python 336 postprocessing_0_0 DEFAULT"
I0801 06:26:01.653348 2034 python_be.cc:2425] "TRITONBACKEND_ModelInstanceInitialize: instance initialization successful postprocessing_0_0 (device 0)"
I0801 06:26:01.653540 2034 backend_model_instance.cc:772] "Starting backend thread for postprocessing_0_0 at nice 0 on device 0..."
I0801 06:26:01.677544 2034 python_be.cc:2425] "TRITONBACKEND_ModelInstanceInitialize: instance initialization successful postprocessing_0_1 (device 0)"
I0801 06:26:01.677657 2034 backend_model_instance.cc:772] "Starting backend thread for postprocessing_0_1 at nice 0 on device 0..."
I0801 06:26:01.677837 2034 model_lifecycle.cc:838] "successfully loaded 'postprocessing'"
I0801 06:26:01.827707 2033 python_be.cc:2425] "TRITONBACKEND_ModelInstanceInitialize: instance initialization successful postprocessing_0_0 (device 0)"
I0801 06:26:01.827893 2033 backend_model_instance.cc:772] "Starting backend thread for postprocessing_0_0 at nice 0 on device 0..."
I0801 06:26:01.861089 2033 python_be.cc:2425] "TRITONBACKEND_ModelInstanceInitialize: instance initialization successful postprocessing_0_1 (device 0)"
I0801 06:26:01.861263 2033 backend_model_instance.cc:772] "Starting backend thread for postprocessing_0_1 at nice 0 on device 0..."
I0801 06:26:01.861464 2033 model_lifecycle.cc:838] "successfully loaded 'postprocessing'"
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  getting local rank failed
  --> Returned value No permission (-17) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_init failed
  --> Returned value No permission (-17) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "No permission" (-17) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[ip-172-31-47-85.us-east-2.compute.internal:02328] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[ip-172-31-47-85.us-east-2.compute.internal:02330] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
[ip-172-31-47-85.us-east-2.compute.internal:02028] 1 more process has sent help message help-orte-runtime.txt / orte_init:startup:internal-failure
[ip-172-31-47-85.us-east-2.compute.internal:02028] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[ip-172-31-47-85.us-east-2.compute.internal:02028] 1 more process has sent help message help-orte-runtime / orte_init:startup:internal-failure
[ip-172-31-47-85.us-east-2.compute.internal:02028] 1 more process has sent help message help-mpi-runtime.txt / mpi_init:startup:internal-failure

And this is the initialize function

    def initialize(self, args):

        model_config = json.loads(args['model_config'])
        self.engine_dir = model_config['parameters']['engine_dir']['string_value']
        self.comm = mpi_comm()
        self.rank = mpi_rank()
        self.trtllm_version = version.parse(tensorrt_llm.__version__)

        self.exclude_input_from_output = get_parameter(
            model_config, "exclude_input_in_output", bool)
        self.input_len_dtype = pb_utils.triton_string_to_numpy(
            pb_utils.get_output_config_by_name(model_config, "request_input_len")["data_type"]
        )
        # os.environ["CUDA_VISIBLE_DEVICES"] = device_id
        self.runner = ModelRunner.from_dir(engine_dir=self.engine_dir,
                                           rank=self.rank,
                                           debug_mode=False)

        if self.rank != 0:
            while (True):
                self.execute([None])

additional notes

Anything could be wrong in our code?

I am using an ensemble model

(pytorch) [ec2-user@ip-172-31-47-85 agm]$ tree .
.
├── agm_model
│   ├── 1
│   └── config.pbtxt
├── postprocessing
│   ├── 1
│   │   ├── model.py
│   │   └── __pycache__
│   │       └── model.cpython-310.pyc
│   └── config.pbtxt
├── preprocessing
│   ├── 1
│   │   ├── model.py
│   │   └── __pycache__
│   │       └── model.cpython-310.pyc
│   └── config.pbtxt
└── tensorrt_llm
    ├── 1
    │   ├── model.py
    │   └── __pycache__
    │       └── model.cpython-310.pyc
    └── config.pbtxt

DZADSL72-00558 commented 2 months ago

I resolved by running PMIX_MCA_gds=hash by referring to this page https://github.com/open-mpi/ompi/issues/6981. Please tell me if my solution makes sense.

after I added this solution. Server gets stuck after https://github.com/triton-inference-server/python_backend/blob/main/src/stub_launcher.cc#L253-L256. Please let me know where could be wrong

Slyne commented 2 months ago

Hi @DZADSL72-00558, How many H100 GPUs are there? Could you share the config files? And by server gets stuck after .../src/stub_launcher.cc, which python backend get stucks, the tensorrt_llm or the preprocessing/postprocessing? Is it possible to share the model.py as well?

DZADSL72-00558 commented 2 months ago

Hi Slyne,

Nice to hear from you. I like your profile BTW.

How many H100 GPUs are there?

So as we are using p5 so, the only answer is 8.

Could you share the config files?

here is the config for trtllm

name: "tensorrt_llm"
backend: "python"
max_batch_size: 0

# # Uncomment this for dynamic_batching
# dynamic_batching {
#    max_queue_delay_microseconds: 50000
# }

input [
  {
    name: "INPUT_ID"
    data_type: TYPE_INT32
    dims: [ 1, -1 ]
  },
  {
    name: "PROMPT_TABLE"
    data_type: TYPE_FP16
    dims: [ -1, -1 ]
  },
  {
    name: "request_output_len"
    data_type: TYPE_UINT32
    dims: [ -1 ]
  },
  {
    name: "END_ID"
    data_type: TYPE_INT32
    dims: [ 1 ]
  },
  {
    name: "PAD_ID"
    data_type: TYPE_INT32
    dims: [ 1 ]
  },
    {
    name: "runtime_top_k"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "runtime_top_p"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "temperature"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "len_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "repetition_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "min_length"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "presence_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "frequency_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "random_seed"
    data_type: TYPE_UINT64
    dims: [ 1 ]
    optional: true
  },
  {
    name: "beam_width"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "output_log_probs"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    optional: true
  }
]
output [
  {
    name: "output_ids"
    data_type: TYPE_INT32
    dims: [ 1, -1 ]
  },
  {
    name: "request_input_len"
    data_type: TYPE_INT32
    dims: [ 1, 1 ]
}
]
instance_group [
    {
        count: 1
        kind: KIND_GPU
    }
]
parameters: {
  key: "engine_dir"
  value: {
    string_value: "/tmp/models/agm/tensorrt_llm/1/engine"
  }
}
parameters: {
  key: "exclude_input_in_output"
  value: {
    string_value: "yes"
  }
}
parameters: {
  key: "FORCE_CPU_ONLY_INPUT_TENSORS"
  value: {
    string_value: "no"
  }
}

which python backend get stucks

it is tensorrt_llm

Is it possible to share the model.py as well

Hmmm, not sure if I can share the entire file, but I have the initialize function attached.

snjoseph commented 2 months ago

I think I have some findings that might clarify the issue. The hang seems to be related to this old issue: https://github.com/triton-inference-server/server/issues/3777. In that issue, it was (eventually) discovered that import torch in the model.py caused an invalid pointer free and SIGABRT. (Some searching seems to indicate that this happens when pybind tries to load torch; it's not specific to Triton.)

The SIGABRT (maybe surprisingly) does not seem to have any negative impact when tritonserver is directly invoked, but it does correlate with the hang we see in this issue, when tritonserver is invoked via mpirun (to support TP). In particular, when we load just the postprocessing model (which does not import torch) via the following command:

mpirun --allow-run-as-root -n 1 tritonserver --model-repository=/opt/amazon/alexa_triton_inference_engine/configuration/agm-streaming/ --http-port=8002 --grpc-port=8003 --model-load-thread-count=1 --model-control-mode=explicit --load-model=postprocessing --log-verbose=3

then the server seems to start correctly. (Note that I used -n 1 to avoid extraneous issues.) However, I get a hang with the following command (identical to the above but with preprocessing, which does import torch):

mpirun --allow-run-as-root -n 1 tritonserver --model-repository=/opt/amazon/alexa_triton_inference_engine/configuration/agm-streaming/ --http-port=8002 --grpc-port=8003 --model-load-thread-count=1 --model-control-mode=explicit --load-model=preprocessing --log-verbose=3

Here is the output up until the hang:

I0812 16:23:57.708810 2689 cache_manager.cc:480] Create CacheManager with cache_dir: '/opt/tritonserver/caches'
I0812 16:23:58.156896 2689 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7f2e80000000' with size 268435456
I0812 16:23:58.159435 2689 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I0812 16:23:58.164541 2689 model_config_utils.cc:680] Server side auto-completed config: name: "preprocessing"
input {
  name: "TEXT_TOKENS"
  data_type: TYPE_INT32
  dims: 1
  dims: -1
}
input {
  name: "SPEECH_EMBEDDINGS"
  data_type: TYPE_FP32
  dims: 1
  dims: -1
  dims: -1
}
input {
  name: "MODALITY_SEQUENCE"
  data_type: TYPE_UINT32
  dims: 1
  dims: -1
  optional: true
}
output {
  name: "INPUT_ID"
  data_type: TYPE_INT32
  dims: 1
  dims: -1
}
output {
  name: "PROMPT_TABLE"
  data_type: TYPE_FP16
  dims: -1
  dims: -1
}
output {
  name: "END_ID"
  data_type: TYPE_INT32
  dims: 1
}
output {
  name: "PAD_ID"
  data_type: TYPE_INT32
  dims: 1
}
instance_group {
  count: 1
  kind: KIND_CPU
}
default_model_filename: "model.py"
parameters {
  key: "audio_modality_indicator_token"
  value {
    string_value: "1"
  }
}
parameters {
  key: "encoder_projections_bias"
  value {
    string_value: "continuous_speech_embedding_fn.bias"
  }
}
parameters {
  key: "encoder_projections_dir"
  value {
    string_value: "/tmp/models/agm/preprocessing/1/encoder_projection/"
  }
}
parameters {
  key: "encoder_projections_weight"
  value {
    string_value: "continuous_speech_embedding_fn.weight"
  }
}
parameters {
  key: "model_config_path"
  value {
    string_value: "/tmp/models/agm/preprocessing/1/config.json"
  }
}
parameters {
  key: "text_modality_indicator_token"
  value {
    string_value: "0"
  }
}
backend: "python"

I0812 16:23:58.164635 2689 model_lifecycle.cc:438] AsyncLoad() 'preprocessing'
I0812 16:23:58.164687 2689 model_lifecycle.cc:469] loading: preprocessing:1
I0812 16:23:58.164743 2689 model_lifecycle.cc:547] CreateModel() 'preprocessing' version 1
I0812 16:23:58.164877 2689 backend_model.cc:502] Adding default backend config setting: default-max-batch-size,4
I0812 16:23:58.164906 2689 shared_library.cc:112] OpenLibraryHandle: /opt/tritonserver/backends/python/libtriton_python.so
I0812 16:23:58.166882 2689 python_be.cc:2067] 'python' TRITONBACKEND API version: 1.18
I0812 16:23:58.166894 2689 python_be.cc:2089] backend configuration:
{"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}}
I0812 16:23:58.166919 2689 python_be.cc:2228] Shared memory configuration is shm-default-byte-size=1048576,shm-growth-byte-size=1048576,stub-timeout-seconds=30
I0812 16:23:58.167080 2689 python_be.cc:2541] TRITONBACKEND_GetBackendAttribute: setting attributes
I0812 16:23:58.169594 2689 python_be.cc:2319] TRITONBACKEND_ModelInitialize: preprocessing (version 1)
I0812 16:23:58.170153 2689 model_config_utils.cc:1902] ModelConfig 64-bit fields:
I0812 16:23:58.170164 2689 model_config_utils.cc:1904]  ModelConfig::dynamic_batching::default_priority_level
I0812 16:23:58.170169 2689 model_config_utils.cc:1904]  ModelConfig::dynamic_batching::default_queue_policy::default_timeout_microseconds
I0812 16:23:58.170172 2689 model_config_utils.cc:1904]  ModelConfig::dynamic_batching::max_queue_delay_microseconds
I0812 16:23:58.170176 2689 model_config_utils.cc:1904]  ModelConfig::dynamic_batching::priority_levels
I0812 16:23:58.170179 2689 model_config_utils.cc:1904]  ModelConfig::dynamic_batching::priority_queue_policy::key
I0812 16:23:58.170183 2689 model_config_utils.cc:1904]  ModelConfig::dynamic_batching::priority_queue_policy::value::default_timeout_microseconds
I0812 16:23:58.170188 2689 model_config_utils.cc:1904]  ModelConfig::ensemble_scheduling::step::model_version
I0812 16:23:58.170191 2689 model_config_utils.cc:1904]  ModelConfig::input::dims
I0812 16:23:58.170195 2689 model_config_utils.cc:1904]  ModelConfig::input::reshape::shape
I0812 16:23:58.170198 2689 model_config_utils.cc:1904]  ModelConfig::instance_group::secondary_devices::device_id
I0812 16:23:58.170202 2689 model_config_utils.cc:1904]  ModelConfig::model_warmup::inputs::value::dims
I0812 16:23:58.170205 2689 model_config_utils.cc:1904]  ModelConfig::optimization::cuda::graph_spec::graph_lower_bound::input::value::dim
I0812 16:23:58.170210 2689 model_config_utils.cc:1904]  ModelConfig::optimization::cuda::graph_spec::input::value::dim
I0812 16:23:58.170214 2689 model_config_utils.cc:1904]  ModelConfig::output::dims
I0812 16:23:58.170217 2689 model_config_utils.cc:1904]  ModelConfig::output::reshape::shape
I0812 16:23:58.170222 2689 model_config_utils.cc:1904]  ModelConfig::sequence_batching::direct::max_queue_delay_microseconds
I0812 16:23:58.170226 2689 model_config_utils.cc:1904]  ModelConfig::sequence_batching::max_sequence_idle_microseconds
I0812 16:23:58.170229 2689 model_config_utils.cc:1904]  ModelConfig::sequence_batching::oldest::max_queue_delay_microseconds
I0812 16:23:58.170234 2689 model_config_utils.cc:1904]  ModelConfig::sequence_batching::state::dims
I0812 16:23:58.170239 2689 model_config_utils.cc:1904]  ModelConfig::sequence_batching::state::initial_state::dims
I0812 16:23:58.170244 2689 model_config_utils.cc:1904]  ModelConfig::version_policy::specific::versions
I0812 16:23:58.170851 2689 stub_launcher.cc:253] Starting Python backend stub:  exec /opt/tritonserver/backends/python/triton_python_backend_stub /opt/amazon/alexa_triton_inference_engine/configuration/agm-streaming/preprocessing/1/model.py triton_python_backend_shm_region_1 1048576 1048576 2689 /opt/tritonserver/backends/python 336 preprocessing DEFAULT
[TensorRT-LLM] TensorRT-LLM version: 0.9.0
free(): invalid pointer
[ip-172-31-47-85:02696] *** Process received signal ***
[ip-172-31-47-85:02696] Signal: Aborted (6)
[ip-172-31-47-85:02696] Signal code:  (-6)
[ip-172-31-47-85:02696] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f27b0616520]
[ip-172-31-47-85:02696] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f27b066a9fc]
[ip-172-31-47-85:02696] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f27b0616476]
[ip-172-31-47-85:02696] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f27b05fc7f3]
[ip-172-31-47-85:02696] [ 4] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x89676)[0x7f27b065d676]
[ip-172-31-47-85:02696] [ 5] /usr/lib/x86_64-linux-gnu/libc.so.6(+0xa0cfc)[0x7f27b0674cfc]
[ip-172-31-47-85:02696] [ 6] /usr/lib/x86_64-linux-gnu/libc.so.6(+0xa2a44)[0x7f27b0676a44]
[ip-172-31-47-85:02696] [ 7] /usr/lib/x86_64-linux-gnu/libc.so.6(free+0x73)[0x7f27b0679453]
[ip-172-31-47-85:02696] [ 8] /opt/tritonserver/backends/python/triton_python_backend_stub(+0x6fd54)[0x5555bbe28d54]
[ip-172-31-47-85:02696] [ 9] /opt/tritonserver/backends/python/triton_python_backend_stub(+0x25de3)[0x5555bbddede3]
[ip-172-31-47-85:02696] [10] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f27b05fdd90]
[ip-172-31-47-85:02696] [11] /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f27b05fde40]
[ip-172-31-47-85:02696] [12] /opt/tritonserver/backends/python/triton_python_backend_stub(+0x26b45)[0x5555bbddfb45]
[ip-172-31-47-85:02696] *** End of error message ***
I0812 16:24:03.233015 2689 python_be.cc:2023] model configuration:
{
    "name": "preprocessing",
    "platform": "",
    "backend": "python",
    "runtime": "",
    "version_policy": {
        "latest": {
            "num_versions": 1
        }
    },
    "max_batch_size": 0,
    "input": [
        {
            "name": "TEXT_TOKENS",
            "data_type": "TYPE_INT32",
            "format": "FORMAT_NONE",
            "dims": [
                1,
                -1
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": false
        },
        {
            "name": "SPEECH_EMBEDDINGS",
            "data_type": "TYPE_FP32",
            "format": "FORMAT_NONE",
            "dims": [
                1,
                -1,
                -1
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": false
        },
        {
            "name": "MODALITY_SEQUENCE",
            "data_type": "TYPE_UINT32",
            "format": "FORMAT_NONE",
            "dims": [
                1,
                -1
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": true
        }
    ],
    "output": [
        {
            "name": "INPUT_ID",
            "data_type": "TYPE_INT32",
            "dims": [
                1,
                -1
            ],
            "label_filename": "",
            "is_shape_tensor": false
        },
        {
            "name": "PROMPT_TABLE",
            "data_type": "TYPE_FP16",
            "dims": [
                -1,
                -1
            ],
            "label_filename": "",
            "is_shape_tensor": false
        },
        {
            "name": "END_ID",
            "data_type": "TYPE_INT32",
            "dims": [
                1
            ],
            "label_filename": "",
            "is_shape_tensor": false
        },
        {
            "name": "PAD_ID",
            "data_type": "TYPE_INT32",
            "dims": [
                1
            ],
            "label_filename": "",
            "is_shape_tensor": false
        }
    ],
    "batch_input": [],
    "batch_output": [],
    "optimization": {
        "priority": "PRIORITY_DEFAULT",
        "input_pinned_memory": {
            "enable": true
        },
        "output_pinned_memory": {
            "enable": true
        },
        "gather_kernel_buffer_threshold": 0,
        "eager_batching": false
    },
    "instance_group": [
        {
            "name": "preprocessing_0",
            "kind": "KIND_CPU",
            "count": 1,
            "gpus": [],
            "secondary_devices": [],
            "profile": [],
            "passive": false,
            "host_policy": ""
        }
    ],
    "default_model_filename": "model.py",
    "cc_model_filenames": {},
    "metric_tags": {},
    "parameters": {
        "encoder_projections_dir": {
            "string_value": "/tmp/models/agm/preprocessing/1/encoder_projection/"
        },
        "encoder_projections_bias": {
            "string_value": "continuous_speech_embedding_fn.bias"
        },
        "audio_modality_indicator_token": {
            "string_value": "1"
        },
        "model_config_path": {
            "string_value": "/tmp/models/agm/preprocessing/1/config.json"
        },
        "encoder_projections_weight": {
            "string_value": "continuous_speech_embedding_fn.weight"
        },
        "text_modality_indicator_token": {
            "string_value": "0"
        }
    },
    "model_warmup": []
}
I0812 16:24:03.233332 2689 python_be.cc:2363] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)
I0812 16:24:03.233373 2689 backend_model_instance.cc:69] Creating instance preprocessing_0_0 on CPU using artifact 'model.py'
I0812 16:24:03.234424 2689 stub_launcher.cc:253] Starting Python backend stub:  exec /opt/tritonserver/backends/python/triton_python_backend_stub /opt/amazon/alexa_triton_inference_engine/configuration/agm-streaming/preprocessing/1/model.py triton_python_backend_shm_region_2 1048576 1048576 2689 /opt/tritonserver/backends/python 336 preprocessing_0_0 DEFAULT

Looking into the python backend stub code, I did notice that there's some process fork and IPC that occurs--maybe there's some kind of race condition that gets triggered when running under MPI?

snjoseph commented 2 months ago

Actually torch is not the culprit, it is tensorrt_llm.profiler. I am able to reproduce using the add_sub example here: https://github.com/triton-inference-server/python_backend/tree/r23.12. I just add import tensorrt_llm.profiler to the model.py and run:

mpirun --allow-run-as-root -n 1 tritonserver --model-repository `pwd`/models --log-verbose=3

I then get the hang with the following logs:

I0812 16:46:39.824892 4370 cache_manager.cc:480] Create CacheManager with cache_dir: '/opt/tritonserver/caches'
I0812 16:46:40.276042 4370 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7fcfa0000000' with size 268435456
I0812 16:46:40.278472 4370 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I0812 16:46:40.283647 4370 model_config_utils.cc:680] Server side auto-completed config: name: "add_sub"
input {
  name: "INPUT0"
  data_type: TYPE_FP32
  dims: 4
}
input {
  name: "INPUT1"
  data_type: TYPE_FP32
  dims: 4
}
output {
  name: "OUTPUT0"
  data_type: TYPE_FP32
  dims: 4
}
output {
  name: "OUTPUT1"
  data_type: TYPE_FP32
  dims: 4
}
instance_group {
  kind: KIND_CPU
}
default_model_filename: "model.py"
backend: "python"

I0812 16:46:40.283706 4370 model_lifecycle.cc:438] AsyncLoad() 'add_sub'
I0812 16:46:40.283747 4370 model_lifecycle.cc:469] loading: add_sub:1
I0812 16:46:40.283827 4370 model_lifecycle.cc:547] CreateModel() 'add_sub' version 1
I0812 16:46:40.283968 4370 backend_model.cc:502] Adding default backend config setting: default-max-batch-size,4
I0812 16:46:40.283995 4370 shared_library.cc:112] OpenLibraryHandle: /opt/tritonserver/backends/python/libtriton_python.so
I0812 16:46:40.285867 4370 python_be.cc:2067] 'python' TRITONBACKEND API version: 1.18
I0812 16:46:40.285881 4370 python_be.cc:2089] backend configuration:
{"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}}
I0812 16:46:40.285909 4370 python_be.cc:2228] Shared memory configuration is shm-default-byte-size=1048576,shm-growth-byte-size=1048576,stub-timeout-seconds=30
I0812 16:46:40.286086 4370 python_be.cc:2541] TRITONBACKEND_GetBackendAttribute: setting attributes
I0812 16:46:40.288612 4370 python_be.cc:2319] TRITONBACKEND_ModelInitialize: add_sub (version 1)
I0812 16:46:40.289090 4370 model_config_utils.cc:1902] ModelConfig 64-bit fields:
I0812 16:46:40.289100 4370 model_config_utils.cc:1904]  ModelConfig::dynamic_batching::default_priority_level
I0812 16:46:40.289104 4370 model_config_utils.cc:1904]  ModelConfig::dynamic_batching::default_queue_policy::default_timeout_microseconds
I0812 16:46:40.289109 4370 model_config_utils.cc:1904]  ModelConfig::dynamic_batching::max_queue_delay_microseconds
I0812 16:46:40.289112 4370 model_config_utils.cc:1904]  ModelConfig::dynamic_batching::priority_levels
I0812 16:46:40.289117 4370 model_config_utils.cc:1904]  ModelConfig::dynamic_batching::priority_queue_policy::key
I0812 16:46:40.289121 4370 model_config_utils.cc:1904]  ModelConfig::dynamic_batching::priority_queue_policy::value::default_timeout_microseconds
I0812 16:46:40.289124 4370 model_config_utils.cc:1904]  ModelConfig::ensemble_scheduling::step::model_version
I0812 16:46:40.289128 4370 model_config_utils.cc:1904]  ModelConfig::input::dims
I0812 16:46:40.289131 4370 model_config_utils.cc:1904]  ModelConfig::input::reshape::shape
I0812 16:46:40.289135 4370 model_config_utils.cc:1904]  ModelConfig::instance_group::secondary_devices::device_id
I0812 16:46:40.289138 4370 model_config_utils.cc:1904]  ModelConfig::model_warmup::inputs::value::dims
I0812 16:46:40.289142 4370 model_config_utils.cc:1904]  ModelConfig::optimization::cuda::graph_spec::graph_lower_bound::input::value::dim
I0812 16:46:40.289145 4370 model_config_utils.cc:1904]  ModelConfig::optimization::cuda::graph_spec::input::value::dim
I0812 16:46:40.289149 4370 model_config_utils.cc:1904]  ModelConfig::output::dims
I0812 16:46:40.289152 4370 model_config_utils.cc:1904]  ModelConfig::output::reshape::shape
I0812 16:46:40.289155 4370 model_config_utils.cc:1904]  ModelConfig::sequence_batching::direct::max_queue_delay_microseconds
I0812 16:46:40.289159 4370 model_config_utils.cc:1904]  ModelConfig::sequence_batching::max_sequence_idle_microseconds
I0812 16:46:40.289163 4370 model_config_utils.cc:1904]  ModelConfig::sequence_batching::oldest::max_queue_delay_microseconds
I0812 16:46:40.289167 4370 model_config_utils.cc:1904]  ModelConfig::sequence_batching::state::dims
I0812 16:46:40.289170 4370 model_config_utils.cc:1904]  ModelConfig::sequence_batching::state::initial_state::dims
I0812 16:46:40.289173 4370 model_config_utils.cc:1904]  ModelConfig::version_policy::specific::versions
I0812 16:46:40.289739 4370 stub_launcher.cc:253] Starting Python backend stub:  exec /opt/tritonserver/backends/python/triton_python_backend_stub /opt/tritonserver/python_backend/models/add_sub/1/model.py triton_python_backend_shm_region_1 1048576 1048576 4370 /opt/tritonserver/backends/python 336 add_sub DEFAULT
[TensorRT-LLM] TensorRT-LLM version: 0.9.0
free(): invalid pointer
[ip-172-31-47-85:04380] *** Process received signal ***
[ip-172-31-47-85:04380] Signal: Aborted (6)
[ip-172-31-47-85:04380] Signal code:  (-6)
[ip-172-31-47-85:04380] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f50bd016520]
[ip-172-31-47-85:04380] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f50bd06a9fc]
[ip-172-31-47-85:04380] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f50bd016476]
[ip-172-31-47-85:04380] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f50bcffc7f3]
[ip-172-31-47-85:04380] [ 4] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x89676)[0x7f50bd05d676]
[ip-172-31-47-85:04380] [ 5] /usr/lib/x86_64-linux-gnu/libc.so.6(+0xa0cfc)[0x7f50bd074cfc]
[ip-172-31-47-85:04380] [ 6] /usr/lib/x86_64-linux-gnu/libc.so.6(+0xa2a44)[0x7f50bd076a44]
[ip-172-31-47-85:04380] [ 7] /usr/lib/x86_64-linux-gnu/libc.so.6(free+0x73)[0x7f50bd079453]
[ip-172-31-47-85:04380] [ 8] /opt/tritonserver/backends/python/triton_python_backend_stub(+0x6fd54)[0x55d6e7b2ad54]
[ip-172-31-47-85:04380] [ 9] /opt/tritonserver/backends/python/triton_python_backend_stub(+0x25de3)[0x55d6e7ae0de3]
[ip-172-31-47-85:04380] [10] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f50bcffdd90]
[ip-172-31-47-85:04380] [11] /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f50bcffde40]
[ip-172-31-47-85:04380] [12] /opt/tritonserver/backends/python/triton_python_backend_stub(+0x26b45)[0x55d6e7ae1b45]
[ip-172-31-47-85:04380] *** End of error message ***
I0812 16:46:45.340856 4370 python_be.cc:2023] model configuration:
{
    "name": "add_sub",
    "platform": "",
    "backend": "python",
    "runtime": "",
    "version_policy": {
        "latest": {
            "num_versions": 1
        }
    },
    "max_batch_size": 0,
    "input": [
        {
            "name": "INPUT0",
            "data_type": "TYPE_FP32",
            "format": "FORMAT_NONE",
            "dims": [
                4
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": false
        },
        {
            "name": "INPUT1",
            "data_type": "TYPE_FP32",
            "format": "FORMAT_NONE",
            "dims": [
                4
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": false
        }
    ],
    "output": [
        {
            "name": "OUTPUT0",
            "data_type": "TYPE_FP32",
            "dims": [
                4
            ],
            "label_filename": "",
            "is_shape_tensor": false
        },
        {
            "name": "OUTPUT1",
            "data_type": "TYPE_FP32",
            "dims": [
                4
            ],
            "label_filename": "",
            "is_shape_tensor": false
        }
    ],
    "batch_input": [],
    "batch_output": [],
    "optimization": {
        "priority": "PRIORITY_DEFAULT",
        "input_pinned_memory": {
            "enable": true
        },
        "output_pinned_memory": {
            "enable": true
        },
        "gather_kernel_buffer_threshold": 0,
        "eager_batching": false
    },
    "instance_group": [
        {
            "name": "add_sub_0",
            "kind": "KIND_CPU",
            "count": 1,
            "gpus": [],
            "secondary_devices": [],
            "profile": [],
            "passive": false,
            "host_policy": ""
        }
    ],
    "default_model_filename": "model.py",
    "cc_model_filenames": {},
    "metric_tags": {},
    "parameters": {},
    "model_warmup": []
}
I0812 16:46:45.341147 4370 python_be.cc:2363] TRITONBACKEND_ModelInstanceInitialize: add_sub_0_0 (CPU device 0)
I0812 16:46:45.341185 4370 backend_model_instance.cc:69] Creating instance add_sub_0_0 on CPU using artifact 'model.py'
I0812 16:46:45.342043 4370 stub_launcher.cc:253] Starting Python backend stub:  exec /opt/tritonserver/backends/python/triton_python_backend_stub /opt/tritonserver/python_backend/models/add_sub/1/model.py triton_python_backend_shm_region_2 1048576 1048576 4370 /opt/tritonserver/backends/python 336 add_sub_0_0 DEFAULT

Slyne commented 2 months ago

@Tabrizian @tanmayv25 Any ideas?

snjoseph commented 2 months ago

Sorry, I realized I was using our own modified container in my runs above, so I tried again with nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3 (as @DZADSL72-00558 did). The output looks different (there's no SIGABRT in the logs) but the hang is still there:

# mpirun --allow-run-as-root -n 1 tritonserver --model-repository `pwd`/models --log-verbose=3
I0812 19:08:32.367035 2315 cache_manager.cc:480] "Create CacheManager with cache_dir: '/opt/tritonserver/caches'"
I0812 19:08:35.061925 2315 pinned_memory_manager.cc:275] "Pinned memory pool is created at '0x7f0bb2000000' with size 268435456"
I0812 19:08:35.097474 2315 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 0 with size 67108864"
I0812 19:08:35.097487 2315 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 1 with size 67108864"
I0812 19:08:35.097493 2315 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 2 with size 67108864"
I0812 19:08:35.097497 2315 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 3 with size 67108864"
I0812 19:08:35.097502 2315 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 4 with size 67108864"
I0812 19:08:35.097506 2315 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 5 with size 67108864"
I0812 19:08:35.097511 2315 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 6 with size 67108864"
I0812 19:08:35.097515 2315 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 7 with size 67108864"
I0812 19:08:36.536436 2315 model_config_utils.cc:681] "Server side auto-completed config: "
name: "add_sub"
input {
  name: "INPUT0"
  data_type: TYPE_FP32
  dims: 4
}
input {
  name: "INPUT1"
  data_type: TYPE_FP32
  dims: 4
}
output {
  name: "OUTPUT0"
  data_type: TYPE_FP32
  dims: 4
}
output {
  name: "OUTPUT1"
  data_type: TYPE_FP32
  dims: 4
}
instance_group {
  kind: KIND_CPU
}
default_model_filename: "model.py"
backend: "python"

I0812 19:08:36.536499 2315 model_lifecycle.cc:441] "AsyncLoad() 'add_sub'"
I0812 19:08:36.536538 2315 model_lifecycle.cc:472] "loading: add_sub:1"
I0812 19:08:36.536596 2315 model_lifecycle.cc:550] "CreateModel() 'add_sub' version 1"
I0812 19:08:36.536715 2315 backend_model.cc:503] "Adding default backend config setting: default-max-batch-size,4"
I0812 19:08:36.536736 2315 shared_library.cc:112] "OpenLibraryHandle: /opt/tritonserver/backends/python/libtriton_python.so"
I0812 19:08:36.537937 2315 python_be.cc:2099] "'python' TRITONBACKEND API version: 1.19"
I0812 19:08:36.537951 2315 python_be.cc:2121] "backend configuration:\n{\"cmdline\":{\"auto-complete-config\":\"true\",\"backend-directory\":\"/opt/tritonserver/backends\",\"min-compute-capability\":\"6.000000\",\"default-max-batch-size\":\"4\"}}"
I0812 19:08:36.537971 2315 python_be.cc:2259] "Shared memory configuration is shm-default-byte-size=1048576,shm-growth-byte-size=1048576,stub-timeout-seconds=30"
I0812 19:08:36.538131 2315 python_be.cc:2582] "TRITONBACKEND_GetBackendAttribute: setting attributes"
I0812 19:08:36.558044 2315 python_be.cc:2360] "TRITONBACKEND_ModelInitialize: add_sub (version 1)"
I0812 19:08:36.558491 2315 model_config_utils.cc:1902] "ModelConfig 64-bit fields:"
I0812 19:08:36.558505 2315 model_config_utils.cc:1904] "\tModelConfig::dynamic_batching::default_priority_level"
I0812 19:08:36.558510 2315 model_config_utils.cc:1904] "\tModelConfig::dynamic_batching::default_queue_policy::default_timeout_microseconds"
I0812 19:08:36.558514 2315 model_config_utils.cc:1904] "\tModelConfig::dynamic_batching::max_queue_delay_microseconds"
I0812 19:08:36.558519 2315 model_config_utils.cc:1904] "\tModelConfig::dynamic_batching::priority_levels"
I0812 19:08:36.558524 2315 model_config_utils.cc:1904] "\tModelConfig::dynamic_batching::priority_queue_policy::key"
I0812 19:08:36.558529 2315 model_config_utils.cc:1904] "\tModelConfig::dynamic_batching::priority_queue_policy::value::default_timeout_microseconds"
I0812 19:08:36.558534 2315 model_config_utils.cc:1904] "\tModelConfig::ensemble_scheduling::step::model_version"
I0812 19:08:36.558538 2315 model_config_utils.cc:1904] "\tModelConfig::input::dims"
I0812 19:08:36.558542 2315 model_config_utils.cc:1904] "\tModelConfig::input::reshape::shape"
I0812 19:08:36.558547 2315 model_config_utils.cc:1904] "\tModelConfig::instance_group::secondary_devices::device_id"
I0812 19:08:36.558553 2315 model_config_utils.cc:1904] "\tModelConfig::model_warmup::inputs::value::dims"
I0812 19:08:36.558557 2315 model_config_utils.cc:1904] "\tModelConfig::optimization::cuda::graph_spec::graph_lower_bound::input::value::dim"
I0812 19:08:36.558562 2315 model_config_utils.cc:1904] "\tModelConfig::optimization::cuda::graph_spec::input::value::dim"
I0812 19:08:36.558566 2315 model_config_utils.cc:1904] "\tModelConfig::output::dims"
I0812 19:08:36.558570 2315 model_config_utils.cc:1904] "\tModelConfig::output::reshape::shape"
I0812 19:08:36.558575 2315 model_config_utils.cc:1904] "\tModelConfig::sequence_batching::direct::max_queue_delay_microseconds"
I0812 19:08:36.558579 2315 model_config_utils.cc:1904] "\tModelConfig::sequence_batching::max_sequence_idle_microseconds"
I0812 19:08:36.558583 2315 model_config_utils.cc:1904] "\tModelConfig::sequence_batching::oldest::max_queue_delay_microseconds"
I0812 19:08:36.558588 2315 model_config_utils.cc:1904] "\tModelConfig::sequence_batching::state::dims"
I0812 19:08:36.558592 2315 model_config_utils.cc:1904] "\tModelConfig::sequence_batching::state::initial_state::dims"
I0812 19:08:36.558596 2315 model_config_utils.cc:1904] "\tModelConfig::version_policy::specific::versions"
I0812 19:08:36.559159 2315 stub_launcher.cc:385] "Starting Python backend stub:  exec /opt/tritonserver/backends/python/triton_python_backend_stub /opt/tritonserver/python_backend/models/add_sub/1/model.py triton_python_backend_shm_region_fb2152c7-cf8e-4d73-a098-1112d6be7786 1048576 1048576 2315 /opt/tritonserver/backends/python 336 add_sub DEFAULT"
[TensorRT-LLM] TensorRT-LLM version: 0.9.0
I0812 19:08:40.998141 2315 python_be.cc:2055] "model configuration:\n{\n    \"name\": \"add_sub\",\n    \"platform\": \"\",\n    \"backend\": \"python\",\n    \"runtime\": \"\",\n    \"version_policy\": {\n        \"latest\": {\n            \"num_versions\": 1\n        }\n    },\n    \"max_batch_size\": 0,\n    \"input\": [\n        {\n            \"name\": \"INPUT0\",\n            \"data_type\": \"TYPE_FP32\",\n            \"format\": \"FORMAT_NONE\",\n            \"dims\": [\n                4\n            ],\n            \"is_shape_tensor\": false,\n            \"allow_ragged_batch\": false,\n            \"optional\": false\n        },\n        {\n            \"name\": \"INPUT1\",\n            \"data_type\": \"TYPE_FP32\",\n            \"format\": \"FORMAT_NONE\",\n            \"dims\": [\n                4\n            ],\n            \"is_shape_tensor\": false,\n            \"allow_ragged_batch\": false,\n            \"optional\": false\n        }\n    ],\n    \"output\": [\n        {\n            \"name\": \"OUTPUT0\",\n            \"data_type\": \"TYPE_FP32\",\n            \"dims\": [\n                4\n            ],\n            \"label_filename\": \"\",\n            \"is_shape_tensor\": false\n        },\n        {\n            \"name\": \"OUTPUT1\",\n            \"data_type\": \"TYPE_FP32\",\n            \"dims\": [\n                4\n            ],\n            \"label_filename\": \"\",\n            \"is_shape_tensor\": false\n        }\n    ],\n    \"batch_input\": [],\n    \"batch_output\": [],\n    \"optimization\": {\n        \"priority\": \"PRIORITY_DEFAULT\",\n        \"input_pinned_memory\": {\n            \"enable\": true\n        },\n        \"output_pinned_memory\": {\n            \"enable\": true\n        },\n        \"gather_kernel_buffer_threshold\": 0,\n        \"eager_batching\": false\n    },\n    \"instance_group\": [\n        {\n            \"name\": \"add_sub_0\",\n            \"kind\": \"KIND_CPU\",\n            \"count\": 1,\n            \"gpus\": [],\n            \"secondary_devices\": [],\n            \"profile\": [],\n            \"passive\": false,\n            \"host_policy\": \"\"\n        }\n    ],\n    \"default_model_filename\": \"model.py\",\n    \"cc_model_filenames\": {},\n    \"metric_tags\": {},\n    \"parameters\": {},\n    \"model_warmup\": []\n}"
I0812 19:08:40.998555 2315 python_be.cc:2404] "TRITONBACKEND_ModelInstanceInitialize: add_sub_0_0 (CPU device 0)"
I0812 19:08:40.998593 2315 backend_model_instance.cc:69] "Creating instance add_sub_0_0 on CPU using artifact 'model.py'"
I0812 19:08:40.999266 2315 stub_launcher.cc:385] "Starting Python backend stub:  exec /opt/tritonserver/backends/python/triton_python_backend_stub /opt/tritonserver/python_backend/models/add_sub/1/model.py triton_python_backend_shm_region_4ece1248-92b5-467e-a857-bfaa256bbdf2 1048576 1048576 2315 /opt/tritonserver/backends/python 336 add_sub_0_0 DEFAULT"

snjoseph commented 2 months ago

I have a workaround, adding --disable-auto-complete-config to the tritonserver invocation avoids the hang (and also makes the SIGABRT go away in our custom container). This unblocks us, but I will leave it to the Nvidia side to decide if you want to close it or pursue the root cause.

Slyne commented 2 months ago

Hi @snjoseph , I tried adding import tensorrt_llm.profiler to the add_sub example, and ran the command:

mpirun --allow-run-as-root -n 1 tritonserver --model-repository `pwd`/add_sub --log-verbose=3

It doesn't hang there but gave me the below error.

orte_ess_init failed
  --> Returned value No permission (-17) instead of ORTE_SUCCES

The docker container is the same one mentioned above. I've tested A100 80GB and NVIDIA H100 80GB HBM3. Adding --disable-auto-complete-config does solve the above issue.

triton-inference-server / tensorrtllm_backend