Can't launch triton server following docs, expecting [TensorRT] library version 9.2.0.5 got 9.3.0.1

System Info

CPU architecture x86_64
Nvidia H100 GPU
docker image nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3
TensorRT-LLM tag v0.9.0
tensorrtllm_backend tag v0.9.0
Ubuntu 22.04

Who can help?

@kaiyux @byshiue

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

I'm not able to successfully launch the triton server for a quantized Mixtral model according to readme instructions (using tag v0.9.0 for both tensorrtllm_backend and TensorRT-LLM, nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3 as advised here)

I was able to build the engine and run the run.py script from TensorRT-LLM repro to produce reasonable results, but including the steps for completeness.

python3 convert_checkpoint.py --model_dir /models/Mixtral-8x7B-Instruct-v0.1 \
                             --output_dir ./tllm_checkpoint_1gpu_int8_kv_wq \
                             --dtype float16  \
                             --int8_kv_cache \
                             --use_weight_only \
                             --weight_only_precision int8 

trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_int8_kv_wq \
                 --output_dir ./trt_engines/mixtral/int8_kv_wq/1-gpu \
                 --gpt_attention_plugin float16 \
                 --gemm_plugin float16 \
                 --max_input_len 32768 \
                 --max_output_len 32768 \
                 --max_batch_size 1

python3 ../run.py --max_output_len=500 \
                  --tokenizer_dir /models/Mixtral-8x7B-Instruct-v0.1/ \
                  --engine_dir=./trt_engines/mixtral/int8_kv_wq/1-gpu \
                  --input_text "please tell me a story about a man and his dog."

Then when trying to launch the triton server I did

docker run -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus 1 --mount type=bind,source=/abacus,target=/abacus -v /abacus/repos/tensorrtllm_backend:/tensorrtllm_backend nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3 /bin/bash

> cd tensorrtllm_backend
> mkdir triton_model_repo
> cp -r all_models/inflight_batcher_llm/* triton_model_repo/
> cp /path/to/engine/* triton_model_repo/tensorrt_llm/1

TOKENIZER_DIR=/path/to/tokenizer/
TOKENIZER_TYPE=auto
DECOUPLED_MODE=false
ENGINE_DIR=/path/to/engine/
MODEL_FOLDER=/tensorrtllm_backend/triton_model_repo
MAX_BATCH_SIZE=1
FILL_TEMPLATE_SCRIPT=/tensorrtllm_backend/tools/fill_template.py
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/preprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:${TOKENIZER_TYPE},triton_max_batch_size:${MAX_BATCH_SIZE},preprocessing_instance_count:1
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/postprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:${TOKENIZER_TYPE},triton_max_batch_size:${MAX_BATCH_SIZE},postprocessing_instance_count:1
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},bls_instance_count:1,accumulate_tokens:False
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/ensemble/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm/config.pbtxt triton_max_batch_size:1,decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_DIR},exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0

python3 scripts/launch_triton_server.py --world_size=1 --model_repo=/tensorrtllm_backend/triton_model_repo

and received the error.

Expected behavior

The launch_triton_server.py script should launch the server successfully

actual behavior

The launch_triton_server.py script shows the following error

root@ll04:/tensorrtllm_backend# python3 scripts/launch_triton_server.py --world_size=1 --model_repo=/tensorrtllm_backend/triton_model_repo
root@ll04:/tensorrtllm_backend# I0420 06:13:01.940726 115 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7fb29c000000' with size 268435456
I0420 06:13:01.941032 115 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I0420 06:13:01.955439 115 model_lifecycle.cc:469] loading: postprocessing:1
I0420 06:13:01.955501 115 model_lifecycle.cc:469] loading: preprocessing:1
I0420 06:13:01.955552 115 model_lifecycle.cc:469] loading: tensorrt_llm:1
I0420 06:13:01.955591 115 model_lifecycle.cc:469] loading: tensorrt_llm_bls:1
I0420 06:13:02.249980 115 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
I0420 06:13:03.295373 115 model_lifecycle.cc:835] successfully loaded 'postprocessing'
I0420 06:13:03.659949 115 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)
I0420 06:13:03.660404 115 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_0 (CPU device 0)
[TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false.
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.9 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][INFO] Engine version 0.9.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] Parameter max_lora_rank cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_lora_rank' not found
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'lora_target_modules' not found
[TensorRT-LLM][WARNING] Optional value for parameter lora_target_modules will not be set.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
I0420 06:13:03.942077 115 model_lifecycle.cc:835] successfully loaded 'tensorrt_llm_bls'
[TensorRT-LLM][INFO] MPI size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
I0420 06:13:04.396067 115 model_lifecycle.cc:835] successfully loaded 'preprocessing'
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 65536
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] Loaded engine size: 44813 MiB
[TensorRT-LLM][ERROR] 6: The engine plan file is not compatible with this version of TensorRT, expecting library version 9.2.0.5 got 9.3.0.1, please rebuild.
[TensorRT-LLM][ERROR] 2: [engine.cpp::deserializeEngine::1148] Error Code 2: Internal Error (Assertion engine->deserialize(start, size, allocator, runtime) failed. )
E0420 06:14:22.456950 115 backend_model.cc:691] ERROR: Failed to create instance: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Failed to deserialize cuda engine (/tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:72)
1       0x7fb1fc2614ba tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
2       0x7fb1fc2850a0 /opt/tritonserver/backends/tensorrtllm/libtensorrt_llm.so(+0x79c0a0) [0x7fb1fc2850a0]
3       0x7fb1fe14f742 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(int, std::shared_ptr<nvinfer1::ILogger>, tensorrt_llm::runtime::GptModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, std::vector<unsigned char, std::allocator<unsigned char> > const&, bool, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 1138
4       0x7fb1fe125977 tensorrt_llm::batch_manager::TrtGptModelFactory::create(std::filesystem::path const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 1687
5       0x7fb1fe11ce00 tensorrt_llm::batch_manager::GptManager::GptManager(std::filesystem::path const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, std::function<std::list<std::shared_ptr<tensorrt_llm::batch_manager::InferenceRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::InferenceRequest> > > (int)>, std::function<void (unsigned long, std::list<tensorrt_llm::batch_manager::NamedTensor, std::allocator<tensorrt_llm::batch_manager::NamedTensor> > const&, bool, std::string const&)>, std::function<std::unordered_set<unsigned long, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<unsigned long> > ()>, std::function<void (std::string const&)>, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&, std::optional<unsigned long>, std::optional<int>, bool) + 336
6       0x7fb4f0211b62 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x18b62) [0x7fb4f0211b62]
7       0x7fb4f02123f2 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x193f2) [0x7fb4f02123f2]
8       0x7fb4f0204fd5 TRITONBACKEND_ModelInstanceInitialize + 101
9       0x7fb4fa132296 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1ad296) [0x7fb4fa132296]
10      0x7fb4fa1334d6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1ae4d6) [0x7fb4fa1334d6]
11      0x7fb4fa116045 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x191045) [0x7fb4fa116045]
12      0x7fb4fa116686 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x191686) [0x7fb4fa116686]
13      0x7fb4fa122efd /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19defd) [0x7fb4fa122efd]
14      0x7fb4f9786ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7fb4f9786ee8]
15      0x7fb4fa10cf0b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x187f0b) [0x7fb4fa10cf0b]
16      0x7fb4fa11dc65 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x198c65) [0x7fb4fa11dc65]
17      0x7fb4fa12231e /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19d31e) [0x7fb4fa12231e]
18      0x7fb4fa2140c8 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x28f0c8) [0x7fb4fa2140c8]
19      0x7fb4fa2179ac /opt/tritonserver/bin/../lib/libtritonserver.so(+0x2929ac) [0x7fb4fa2179ac]
20      0x7fb4fa36b6c2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3e66c2) [0x7fb4fa36b6c2]
21      0x7fb4f99f2253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fb4f99f2253]
22      0x7fb4f9781ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fb4f9781ac3]
23      0x7fb4f9812a04 clone + 68
E0420 06:14:22.457146 115 model_lifecycle.cc:638] failed to load 'tensorrt_llm' version 1: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Failed to deserialize cuda engine (/tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:72)
1       0x7fb1fc2614ba tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
2       0x7fb1fc2850a0 /opt/tritonserver/backends/tensorrtllm/libtensorrt_llm.so(+0x79c0a0) [0x7fb1fc2850a0]
3       0x7fb1fe14f742 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(int, std::shared_ptr<nvinfer1::ILogger>, tensorrt_llm::runtime::GptModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, std::vector<unsigned char, std::allocator<unsigned char> > const&, bool, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 1138
4       0x7fb1fe125977 tensorrt_llm::batch_manager::TrtGptModelFactory::create(std::filesystem::path const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 1687
5       0x7fb1fe11ce00 tensorrt_llm::batch_manager::GptManager::GptManager(std::filesystem::path const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, std::function<std::list<std::shared_ptr<tensorrt_llm::batch_manager::InferenceRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::InferenceRequest> > > (int)>, std::function<void (unsigned long, std::list<tensorrt_llm::batch_manager::NamedTensor, std::allocator<tensorrt_llm::batch_manager::NamedTensor> > const&, bool, std::string const&)>, std::function<std::unordered_set<unsigned long, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<unsigned long> > ()>, std::function<void (std::string const&)>, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&, std::optional<unsigned long>, std::optional<int>, bool) + 336
6       0x7fb4f0211b62 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x18b62) [0x7fb4f0211b62]
7       0x7fb4f02123f2 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x193f2) [0x7fb4f02123f2]
8       0x7fb4f0204fd5 TRITONBACKEND_ModelInstanceInitialize + 101
9       0x7fb4fa132296 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1ad296) [0x7fb4fa132296]
10      0x7fb4fa1334d6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1ae4d6) [0x7fb4fa1334d6]
11      0x7fb4fa116045 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x191045) [0x7fb4fa116045]
12      0x7fb4fa116686 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x191686) [0x7fb4fa116686]
13      0x7fb4fa122efd /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19defd) [0x7fb4fa122efd]
14      0x7fb4f9786ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7fb4f9786ee8]
15      0x7fb4fa10cf0b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x187f0b) [0x7fb4fa10cf0b]
16      0x7fb4fa11dc65 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x198c65) [0x7fb4fa11dc65]
17      0x7fb4fa12231e /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19d31e) [0x7fb4fa12231e]
18      0x7fb4fa2140c8 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x28f0c8) [0x7fb4fa2140c8]
19      0x7fb4fa2179ac /opt/tritonserver/bin/../lib/libtritonserver.so(+0x2929ac) [0x7fb4fa2179ac]
20      0x7fb4fa36b6c2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3e66c2) [0x7fb4fa36b6c2]
21      0x7fb4f99f2253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fb4f99f2253]
22      0x7fb4f9781ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fb4f9781ac3]
23      0x7fb4f9812a04 clone + 68
I0420 06:14:22.457192 115 model_lifecycle.cc:773] failed to load 'tensorrt_llm'
E0420 06:14:22.457491 115 model_repository_manager.cc:579] Invalid argument: ensemble 'ensemble' depends on 'tensorrt_llm' which has no loaded version. Model 'tensorrt_llm' loading failed with error: version 1 is at UNAVAILABLE state: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Failed to deserialize cuda engine (/tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:72)
1       0x7fb1fc2614ba tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
2       0x7fb1fc2850a0 /opt/tritonserver/backends/tensorrtllm/libtensorrt_llm.so(+0x79c0a0) [0x7fb1fc2850a0]
3       0x7fb1fe14f742 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(int, std::shared_ptr<nvinfer1::ILogger>, tensorrt_llm::runtime::GptModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, std::vector<unsigned char, std::allocator<unsigned char> > const&, bool, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 1138
4       0x7fb1fe125977 tensorrt_llm::batch_manager::TrtGptModelFactory::create(std::filesystem::path const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 1687
5       0x7fb1fe11ce00 tensorrt_llm::batch_manager::GptManager::GptManager(std::filesystem::path const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, std::function<std::list<std::shared_ptr<tensorrt_llm::batch_manager::InferenceRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::InferenceRequest> > > (int)>, std::function<void (unsigned long, std::list<tensorrt_llm::batch_manager::NamedTensor, std::allocator<tensorrt_llm::batch_manager::NamedTensor> > const&, bool, std::string const&)>, std::function<std::unordered_set<unsigned long, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<unsigned long> > ()>, std::function<void (std::string const&)>, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&, std::optional<unsigned long>, std::optional<int>, bool) + 336
6       0x7fb4f0211b62 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x18b62) [0x7fb4f0211b62]
7       0x7fb4f02123f2 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x193f2) [0x7fb4f02123f2]
8       0x7fb4f0204fd5 TRITONBACKEND_ModelInstanceInitialize + 101
9       0x7fb4fa132296 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1ad296) [0x7fb4fa132296]
10      0x7fb4fa1334d6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1ae4d6) [0x7fb4fa1334d6]
11      0x7fb4fa116045 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x191045) [0x7fb4fa116045]
12      0x7fb4fa116686 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x191686) [0x7fb4fa116686]
13      0x7fb4fa122efd /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19defd) [0x7fb4fa122efd]
14      0x7fb4f9786ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7fb4f9786ee8]
15      0x7fb4fa10cf0b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x187f0b) [0x7fb4fa10cf0b]
16      0x7fb4fa11dc65 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x198c65) [0x7fb4fa11dc65]
17      0x7fb4fa12231e /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19d31e) [0x7fb4fa12231e]
18      0x7fb4fa2140c8 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x28f0c8) [0x7fb4fa2140c8]
19      0x7fb4fa2179ac /opt/tritonserver/bin/../lib/libtritonserver.so(+0x2929ac) [0x7fb4fa2179ac]
20      0x7fb4fa36b6c2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3e66c2) [0x7fb4fa36b6c2]
21      0x7fb4f99f2253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fb4f99f2253]
22      0x7fb4f9781ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fb4f9781ac3]
23      0x7fb4f9812a04 clone + 68;
I0420 06:14:22.457646 115 server.cc:607] 
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0420 06:14:22.457758 115 server.cc:634] 
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Backend     | Path                                                            | Config                                                                                                                                                                    |
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| python      | /opt/tritonserver/backends/python/libtriton_python.so           | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","shm-region-prefix-name":"prefix0_","defa |
|             |                                                                 | ult-max-batch-size":"4"}}                                                                                                                                                 |
| tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}}            |
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0420 06:14:22.458015 115 server.cc:677] 
+------------------+---------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Model            | Version | Status                                                                                                                                                                                                                       |
+------------------+---------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| postprocessing   | 1       | READY                                                                                                                                                                                                                        |
| preprocessing    | 1       | READY                                                                                                                                                                                                                        |
| tensorrt_llm     | 1       | UNAVAILABLE: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Failed to deserialize cuda engine (/tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/tensorrt_llm/runtime/tllm |
|                  |         | Runtime.cpp:72)                                                                                                                                                                                                              |
|                  |         | 1       0x7fb1fc2614ba tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102                                                                                                                   |
|                  |         | 3       0x7fb1fe14f742 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(int, std::shared_ptr<nvinfer1::ILogger>, tensorrt_llm::runtime::GptModelConfig const&, tensorrt_llm::runtime::W |
|                  |         | orldConfig const&, std::vector<unsigned char, std::allocator<unsigned char> > const&, bool, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) +  |
|                  |         | 1138                                                                                                                                                                                                                         |
|                  |         | 3       0x7fb1fe14f742 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(int, std::shared_ptr<nvinfer1::ILogger>, tensorrt_llm::runtime::GptModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, std::vector<unsigned char, std::allocator<unsigned char> > const&, bool, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 1138 |
|                  |         | 4       0x7fb1fe125977 tensorrt_llm::batch_manager::TrtGptModelFactory::create(std::filesystem::path const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 1687 |
|                  |         | 5       0x7fb1fe11ce00 tensorrt_llm::batch_manager::GptManager::GptManager(std::filesystem::path const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, std::function<std::list<std::shared_ptr<tensorrt_llm::batch_manager::InferenceRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::InferenceRequest> > > (int)>, std::function<void (unsigned long, std::list<tensorrt_llm::batch_manager::NamedTensor, std::allocator<tensorrt_llm::batch_manager::NamedTensor> > const&, bool, std::string const&)>, std::function<std::unordered_set<unsigned long, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<unsigned long> > ()>, std::function<void (std::string const&)>, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&, std::optional<unsigned long>, std::optional<int>, bool) + 336 |
|                  |         | 6       0x7fb4f0211b62 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x18b62) [0x7fb4f0211b62]                                                                                                            |
|                  |         | 7       0x7fb4f02123f2 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x193f2) [0x7fb4f02123f2]                                                                                                            |
|                  |         | 8       0x7fb4f0204fd5 TRITONBACKEND_ModelInstanceInitialize + 101                                                                                                                                                           |
|                  |         | 9       0x7fb4fa132296 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1ad296) [0x7fb4fa132296]                                                                                                                           |
|                  |         | 10      0x7fb4fa1334d6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1ae4d6) [0x7fb4fa1334d6]                                                                                                                           |
|                  |         | 11      0x7fb4fa116045 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x191045) [0x7fb4fa116045]                                                                                                                           |
|                  |         | 12      0x7fb4fa116686 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x191686) [0x7fb4fa116686]                                                                                                                           |
|                  |         | 13      0x7fb4fa122efd /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19defd) [0x7fb4fa122efd]                                                                                                                           |
|                  |         | 14      0x7fb4f9786ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7fb4f9786ee8]                                                                                                                                        |
|                  |         | 15      0x7fb4fa10cf0b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x187f0b) [0x7fb4fa10cf0b]                                                                                                                           |
|                  |         | 16      0x7fb4fa11dc65 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x198c65) [0x7fb4fa11dc65]                                                                                                                           |
|                  |         | 17      0x7fb4fa12231e /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19d31e) [0x7fb4fa12231e]                                                                                                                           |
|                  |         | 18      0x7fb4fa2140c8 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x28f0c8) [0x7fb4fa2140c8]                                                                                                                           |
|                  |         | 19      0x7fb4fa2179ac /opt/tritonserver/bin/../lib/libtritonserver.so(+0x2929ac) [0x7fb4fa2179ac]                                                                                                                           |
|                  |         | 20      0x7fb4fa36b6c2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3e66c2) [0x7fb4fa36b6c2]                                                                                                                           |
|                  |         | 21      0x7fb4f99f2253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fb4f99f2253]                                                                                                                                   |
|                  |         | 22      0x7fb4f9781ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fb4f9781ac3]                                                                                                                                        |
|                  |         | 23      0x7fb4f9812a04 clone + 68                                                                                                                                                                                            |
| tensorrt_llm_bls | 1       | READY                                                                                                                                                                                                                        |
+------------------+---------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0420 06:14:22.561009 115 metrics.cc:877] Collecting metrics for GPU 0: NVIDIA H100 PCIe
I0420 06:14:22.564576 115 metrics.cc:770] Collecting CPU metrics
I0420 06:14:22.564878 115 tritonserver.cc:2508] 
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                                                                           |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                                                                          |
| server_version                   | 2.43.0                                                                                                                                                                                                          |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
| model_repository_path[0]         | /tensorrtllm_backend/triton_model_repo                                                                                                                                                                          |
| model_control_mode               | MODE_NONE                                                                                                                                                                                                       |
| strict_model_config              | 1                                                                                                                                                                                                               |
| rate_limit                       | OFF                                                                                                                                                                                                             |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                                                       |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                                                                        |
| min_supported_compute_capability | 6.0                                                                                                                                                                                                             |
| strict_readiness                 | 1                                                                                                                                                                                                               |
| exit_timeout                     | 30                                                                                                                                                                                                              |
| cache_enabled                    | 0                                                                                                                                                                                                               |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0420 06:14:22.564894 115 server.cc:307] Waiting for in-flight requests to complete.
I0420 06:14:22.564905 115 server.cc:323] Timeout 30: Found 0 model versions that have in-flight inferences
I0420 06:14:22.565726 115 server.cc:338] All models are stopped, unloading models
I0420 06:14:22.565743 115 server.cc:347] Timeout 30: Found 3 live models and 0 in-flight non-inference requests
I0420 06:14:23.565944 115 server.cc:347] Timeout 29: Found 3 live models and 0 in-flight non-inference requests
Cleaning up...
Cleaning up...
Cleaning up...
I0420 06:14:23.697852 115 model_lifecycle.cc:620] successfully unloaded 'tensorrt_llm_bls' version 1
I0420 06:14:23.738525 115 model_lifecycle.cc:620] successfully unloaded 'preprocessing' version 1
I0420 06:14:23.889608 115 model_lifecycle.cc:620] successfully unloaded 'postprocessing' version 1
I0420 06:14:24.566219 115 server.cc:347] Timeout 28: Found 0 live models and 0 in-flight non-inference requests
error: creating server: Internal - failed to load all models
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[49909,1],0]
  Exit code:    1

additional notes

Despite the error message says expecting library version 9.2.0.5 got 9.3.0.1 here is the contents of /usr/local/tensorrt/include/NvInferVersion.h

/*
 * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 * SPDX-License-Identifier: LicenseRef-NvidiaProprietary
 *
 * NVIDIA CORPORATION, its affiliates and licensors retain all intellectual
 * property and proprietary rights in and to this material, related
 * documentation and any modifications thereto. Any use, reproduction,
 * disclosure or distribution of this material and related documentation
 * without an express license agreement from NVIDIA CORPORATION or
 * its affiliates is strictly prohibited.
 */

//!
//! \file NvInferVersion.h
//!
//! Defines the TensorRT version
//!
#ifndef NV_INFER_VERSION_H
#define NV_INFER_VERSION_H

#define NV_TENSORRT_MAJOR 9 //!< TensorRT major version.
#define NV_TENSORRT_MINOR 2 //!< TensorRT minor version.
#define NV_TENSORRT_PATCH 0 //!< TensorRT patch version.
#define NV_TENSORRT_BUILD 5 //!< TensorRT build number.

#define NV_TENSORRT_LWS_MAJOR 0 //!< TensorRT LWS major version.
#define NV_TENSORRT_LWS_MINOR 0 //!< TensorRT LWS minor version.
#define NV_TENSORRT_LWS_PATCH 0 //!< TensorRT LWS patch version.

// This #define is deprecated in TensorRT 8.6. Use NV_TENSORRT_MAJOR.
#define NV_TENSORRT_SONAME_MAJOR 9 //!< Shared object library major version number.
// This #define is deprecated in TensorRT 8.6. Use NV_TENSORRT_MINOR.
#define NV_TENSORRT_SONAME_MINOR 2 //!< Shared object library minor version number.
// This #define is deprecated in TensorRT 8.6. Use NV_TENSORRT_PATCH.
#define NV_TENSORRT_SONAME_PATCH 0 //!< Shared object library patch version number.

#define NV_TENSORRT_RELEASE_TYPE_EARLY_ACCESS 0         //!< An early access release
#define NV_TENSORRT_RELEASE_TYPE_RELEASE_CANDIDATE 1    //!< A release candidate
#define NV_TENSORRT_RELEASE_TYPE_GENERAL_AVAILABILITY 2 //!< A final release

#define NV_TENSORRT_RELEASE_TYPE NV_TENSORRT_RELEASE_TYPE_GENERAL_AVAILABILITY //!< TensorRT release type

#endif // NV_INFER_VERSION_H

Also according to this The dependent TensorRT version is updated to 9.3

triton-inference-server / tensorrtllm_backend