triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
581 stars 81 forks source link

Can't launch triton server following docs, expecting [TensorRT] library version 9.2.0.5 got 9.3.0.1 #424

Open conway-abacus opened 2 months ago

conway-abacus commented 2 months ago

System Info

Who can help?

@kaiyux @byshiue

Information

Tasks

Reproduction

I'm not able to successfully launch the triton server for a quantized Mixtral model according to readme instructions (using tag v0.9.0 for both tensorrtllm_backend and TensorRT-LLM, nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3 as advised here)

I was able to build the engine and run the run.py script from TensorRT-LLM repro to produce reasonable results, but including the steps for completeness.

python3 convert_checkpoint.py --model_dir /models/Mixtral-8x7B-Instruct-v0.1 \
                             --output_dir ./tllm_checkpoint_1gpu_int8_kv_wq \
                             --dtype float16  \
                             --int8_kv_cache \
                             --use_weight_only \
                             --weight_only_precision int8 

trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_int8_kv_wq \
                 --output_dir ./trt_engines/mixtral/int8_kv_wq/1-gpu \
                 --gpt_attention_plugin float16 \
                 --gemm_plugin float16 \
                 --max_input_len 32768 \
                 --max_output_len 32768 \
                 --max_batch_size 1

python3 ../run.py --max_output_len=500 \
                  --tokenizer_dir /models/Mixtral-8x7B-Instruct-v0.1/ \
                  --engine_dir=./trt_engines/mixtral/int8_kv_wq/1-gpu \
                  --input_text "please tell me a story about a man and his dog."

Then when trying to launch the triton server I did

docker run -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus 1 --mount type=bind,source=/abacus,target=/abacus -v /abacus/repos/tensorrtllm_backend:/tensorrtllm_backend nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3 /bin/bash

> cd tensorrtllm_backend
> mkdir triton_model_repo
> cp -r all_models/inflight_batcher_llm/* triton_model_repo/
> cp /path/to/engine/* triton_model_repo/tensorrt_llm/1

TOKENIZER_DIR=/path/to/tokenizer/
TOKENIZER_TYPE=auto
DECOUPLED_MODE=false
ENGINE_DIR=/path/to/engine/
MODEL_FOLDER=/tensorrtllm_backend/triton_model_repo
MAX_BATCH_SIZE=1
FILL_TEMPLATE_SCRIPT=/tensorrtllm_backend/tools/fill_template.py
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/preprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:${TOKENIZER_TYPE},triton_max_batch_size:${MAX_BATCH_SIZE},preprocessing_instance_count:1
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/postprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:${TOKENIZER_TYPE},triton_max_batch_size:${MAX_BATCH_SIZE},postprocessing_instance_count:1
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},bls_instance_count:1,accumulate_tokens:False
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/ensemble/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm/config.pbtxt triton_max_batch_size:1,decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_DIR},exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0

python3 scripts/launch_triton_server.py --world_size=1 --model_repo=/tensorrtllm_backend/triton_model_repo

and received the error.

Expected behavior

The launch_triton_server.py script should launch the server successfully

actual behavior

The launch_triton_server.py script shows the following error

root@ll04:/tensorrtllm_backend# python3 scripts/launch_triton_server.py --world_size=1 --model_repo=/tensorrtllm_backend/triton_model_repo
root@ll04:/tensorrtllm_backend# I0420 06:13:01.940726 115 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7fb29c000000' with size 268435456
I0420 06:13:01.941032 115 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I0420 06:13:01.955439 115 model_lifecycle.cc:469] loading: postprocessing:1
I0420 06:13:01.955501 115 model_lifecycle.cc:469] loading: preprocessing:1
I0420 06:13:01.955552 115 model_lifecycle.cc:469] loading: tensorrt_llm:1
I0420 06:13:01.955591 115 model_lifecycle.cc:469] loading: tensorrt_llm_bls:1
I0420 06:13:02.249980 115 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
I0420 06:13:03.295373 115 model_lifecycle.cc:835] successfully loaded 'postprocessing'
I0420 06:13:03.659949 115 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)
I0420 06:13:03.660404 115 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_0 (CPU device 0)
[TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false.
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.9 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][INFO] Engine version 0.9.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] Parameter max_lora_rank cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_lora_rank' not found
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'lora_target_modules' not found
[TensorRT-LLM][WARNING] Optional value for parameter lora_target_modules will not be set.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
I0420 06:13:03.942077 115 model_lifecycle.cc:835] successfully loaded 'tensorrt_llm_bls'
[TensorRT-LLM][INFO] MPI size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
I0420 06:13:04.396067 115 model_lifecycle.cc:835] successfully loaded 'preprocessing'
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 65536
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] Loaded engine size: 44813 MiB
[TensorRT-LLM][ERROR] 6: The engine plan file is not compatible with this version of TensorRT, expecting library version 9.2.0.5 got 9.3.0.1, please rebuild.
[TensorRT-LLM][ERROR] 2: [engine.cpp::deserializeEngine::1148] Error Code 2: Internal Error (Assertion engine->deserialize(start, size, allocator, runtime) failed. )
E0420 06:14:22.456950 115 backend_model.cc:691] ERROR: Failed to create instance: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Failed to deserialize cuda engine (/tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:72)
1       0x7fb1fc2614ba tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
2       0x7fb1fc2850a0 /opt/tritonserver/backends/tensorrtllm/libtensorrt_llm.so(+0x79c0a0) [0x7fb1fc2850a0]
3       0x7fb1fe14f742 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(int, std::shared_ptr<nvinfer1::ILogger>, tensorrt_llm::runtime::GptModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, std::vector<unsigned char, std::allocator<unsigned char> > const&, bool, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 1138
4       0x7fb1fe125977 tensorrt_llm::batch_manager::TrtGptModelFactory::create(std::filesystem::path const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 1687
5       0x7fb1fe11ce00 tensorrt_llm::batch_manager::GptManager::GptManager(std::filesystem::path const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, std::function<std::list<std::shared_ptr<tensorrt_llm::batch_manager::InferenceRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::InferenceRequest> > > (int)>, std::function<void (unsigned long, std::list<tensorrt_llm::batch_manager::NamedTensor, std::allocator<tensorrt_llm::batch_manager::NamedTensor> > const&, bool, std::string const&)>, std::function<std::unordered_set<unsigned long, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<unsigned long> > ()>, std::function<void (std::string const&)>, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&, std::optional<unsigned long>, std::optional<int>, bool) + 336
6       0x7fb4f0211b62 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x18b62) [0x7fb4f0211b62]
7       0x7fb4f02123f2 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x193f2) [0x7fb4f02123f2]
8       0x7fb4f0204fd5 TRITONBACKEND_ModelInstanceInitialize + 101
9       0x7fb4fa132296 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1ad296) [0x7fb4fa132296]
10      0x7fb4fa1334d6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1ae4d6) [0x7fb4fa1334d6]
11      0x7fb4fa116045 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x191045) [0x7fb4fa116045]
12      0x7fb4fa116686 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x191686) [0x7fb4fa116686]
13      0x7fb4fa122efd /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19defd) [0x7fb4fa122efd]
14      0x7fb4f9786ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7fb4f9786ee8]
15      0x7fb4fa10cf0b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x187f0b) [0x7fb4fa10cf0b]
16      0x7fb4fa11dc65 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x198c65) [0x7fb4fa11dc65]
17      0x7fb4fa12231e /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19d31e) [0x7fb4fa12231e]
18      0x7fb4fa2140c8 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x28f0c8) [0x7fb4fa2140c8]
19      0x7fb4fa2179ac /opt/tritonserver/bin/../lib/libtritonserver.so(+0x2929ac) [0x7fb4fa2179ac]
20      0x7fb4fa36b6c2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3e66c2) [0x7fb4fa36b6c2]
21      0x7fb4f99f2253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fb4f99f2253]
22      0x7fb4f9781ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fb4f9781ac3]
23      0x7fb4f9812a04 clone + 68
E0420 06:14:22.457146 115 model_lifecycle.cc:638] failed to load 'tensorrt_llm' version 1: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Failed to deserialize cuda engine (/tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:72)
1       0x7fb1fc2614ba tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
2       0x7fb1fc2850a0 /opt/tritonserver/backends/tensorrtllm/libtensorrt_llm.so(+0x79c0a0) [0x7fb1fc2850a0]
3       0x7fb1fe14f742 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(int, std::shared_ptr<nvinfer1::ILogger>, tensorrt_llm::runtime::GptModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, std::vector<unsigned char, std::allocator<unsigned char> > const&, bool, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 1138
4       0x7fb1fe125977 tensorrt_llm::batch_manager::TrtGptModelFactory::create(std::filesystem::path const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 1687
5       0x7fb1fe11ce00 tensorrt_llm::batch_manager::GptManager::GptManager(std::filesystem::path const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, std::function<std::list<std::shared_ptr<tensorrt_llm::batch_manager::InferenceRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::InferenceRequest> > > (int)>, std::function<void (unsigned long, std::list<tensorrt_llm::batch_manager::NamedTensor, std::allocator<tensorrt_llm::batch_manager::NamedTensor> > const&, bool, std::string const&)>, std::function<std::unordered_set<unsigned long, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<unsigned long> > ()>, std::function<void (std::string const&)>, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&, std::optional<unsigned long>, std::optional<int>, bool) + 336
6       0x7fb4f0211b62 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x18b62) [0x7fb4f0211b62]
7       0x7fb4f02123f2 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x193f2) [0x7fb4f02123f2]
8       0x7fb4f0204fd5 TRITONBACKEND_ModelInstanceInitialize + 101
9       0x7fb4fa132296 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1ad296) [0x7fb4fa132296]
10      0x7fb4fa1334d6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1ae4d6) [0x7fb4fa1334d6]
11      0x7fb4fa116045 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x191045) [0x7fb4fa116045]
12      0x7fb4fa116686 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x191686) [0x7fb4fa116686]
13      0x7fb4fa122efd /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19defd) [0x7fb4fa122efd]
14      0x7fb4f9786ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7fb4f9786ee8]
15      0x7fb4fa10cf0b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x187f0b) [0x7fb4fa10cf0b]
16      0x7fb4fa11dc65 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x198c65) [0x7fb4fa11dc65]
17      0x7fb4fa12231e /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19d31e) [0x7fb4fa12231e]
18      0x7fb4fa2140c8 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x28f0c8) [0x7fb4fa2140c8]
19      0x7fb4fa2179ac /opt/tritonserver/bin/../lib/libtritonserver.so(+0x2929ac) [0x7fb4fa2179ac]
20      0x7fb4fa36b6c2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3e66c2) [0x7fb4fa36b6c2]
21      0x7fb4f99f2253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fb4f99f2253]
22      0x7fb4f9781ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fb4f9781ac3]
23      0x7fb4f9812a04 clone + 68
I0420 06:14:22.457192 115 model_lifecycle.cc:773] failed to load 'tensorrt_llm'
E0420 06:14:22.457491 115 model_repository_manager.cc:579] Invalid argument: ensemble 'ensemble' depends on 'tensorrt_llm' which has no loaded version. Model 'tensorrt_llm' loading failed with error: version 1 is at UNAVAILABLE state: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Failed to deserialize cuda engine (/tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:72)
1       0x7fb1fc2614ba tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
2       0x7fb1fc2850a0 /opt/tritonserver/backends/tensorrtllm/libtensorrt_llm.so(+0x79c0a0) [0x7fb1fc2850a0]
3       0x7fb1fe14f742 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(int, std::shared_ptr<nvinfer1::ILogger>, tensorrt_llm::runtime::GptModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, std::vector<unsigned char, std::allocator<unsigned char> > const&, bool, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 1138
4       0x7fb1fe125977 tensorrt_llm::batch_manager::TrtGptModelFactory::create(std::filesystem::path const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 1687
5       0x7fb1fe11ce00 tensorrt_llm::batch_manager::GptManager::GptManager(std::filesystem::path const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, std::function<std::list<std::shared_ptr<tensorrt_llm::batch_manager::InferenceRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::InferenceRequest> > > (int)>, std::function<void (unsigned long, std::list<tensorrt_llm::batch_manager::NamedTensor, std::allocator<tensorrt_llm::batch_manager::NamedTensor> > const&, bool, std::string const&)>, std::function<std::unordered_set<unsigned long, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<unsigned long> > ()>, std::function<void (std::string const&)>, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&, std::optional<unsigned long>, std::optional<int>, bool) + 336
6       0x7fb4f0211b62 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x18b62) [0x7fb4f0211b62]
7       0x7fb4f02123f2 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x193f2) [0x7fb4f02123f2]
8       0x7fb4f0204fd5 TRITONBACKEND_ModelInstanceInitialize + 101
9       0x7fb4fa132296 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1ad296) [0x7fb4fa132296]
10      0x7fb4fa1334d6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1ae4d6) [0x7fb4fa1334d6]
11      0x7fb4fa116045 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x191045) [0x7fb4fa116045]
12      0x7fb4fa116686 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x191686) [0x7fb4fa116686]
13      0x7fb4fa122efd /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19defd) [0x7fb4fa122efd]
14      0x7fb4f9786ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7fb4f9786ee8]
15      0x7fb4fa10cf0b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x187f0b) [0x7fb4fa10cf0b]
16      0x7fb4fa11dc65 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x198c65) [0x7fb4fa11dc65]
17      0x7fb4fa12231e /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19d31e) [0x7fb4fa12231e]
18      0x7fb4fa2140c8 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x28f0c8) [0x7fb4fa2140c8]
19      0x7fb4fa2179ac /opt/tritonserver/bin/../lib/libtritonserver.so(+0x2929ac) [0x7fb4fa2179ac]
20      0x7fb4fa36b6c2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3e66c2) [0x7fb4fa36b6c2]
21      0x7fb4f99f2253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fb4f99f2253]
22      0x7fb4f9781ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fb4f9781ac3]
23      0x7fb4f9812a04 clone + 68;
I0420 06:14:22.457646 115 server.cc:607] 
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0420 06:14:22.457758 115 server.cc:634] 
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Backend     | Path                                                            | Config                                                                                                                                                                    |
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| python      | /opt/tritonserver/backends/python/libtriton_python.so           | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","shm-region-prefix-name":"prefix0_","defa |
|             |                                                                 | ult-max-batch-size":"4"}}                                                                                                                                                 |
| tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}}            |
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0420 06:14:22.458015 115 server.cc:677] 
+------------------+---------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Model            | Version | Status                                                                                                                                                                                                                       |
+------------------+---------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| postprocessing   | 1       | READY                                                                                                                                                                                                                        |
| preprocessing    | 1       | READY                                                                                                                                                                                                                        |
| tensorrt_llm     | 1       | UNAVAILABLE: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Failed to deserialize cuda engine (/tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/tensorrt_llm/runtime/tllm |
|                  |         | Runtime.cpp:72)                                                                                                                                                                                                              |
|                  |         | 1       0x7fb1fc2614ba tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102                                                                                                                   |
|                  |         | 3       0x7fb1fe14f742 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(int, std::shared_ptr<nvinfer1::ILogger>, tensorrt_llm::runtime::GptModelConfig const&, tensorrt_llm::runtime::W |
|                  |         | orldConfig const&, std::vector<unsigned char, std::allocator<unsigned char> > const&, bool, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) +  |
|                  |         | 1138                                                                                                                                                                                                                         |
|                  |         | 3       0x7fb1fe14f742 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(int, std::shared_ptr<nvinfer1::ILogger>, tensorrt_llm::runtime::GptModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, std::vector<unsigned char, std::allocator<unsigned char> > const&, bool, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 1138 |
|                  |         | 4       0x7fb1fe125977 tensorrt_llm::batch_manager::TrtGptModelFactory::create(std::filesystem::path const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 1687 |
|                  |         | 5       0x7fb1fe11ce00 tensorrt_llm::batch_manager::GptManager::GptManager(std::filesystem::path const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, std::function<std::list<std::shared_ptr<tensorrt_llm::batch_manager::InferenceRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::InferenceRequest> > > (int)>, std::function<void (unsigned long, std::list<tensorrt_llm::batch_manager::NamedTensor, std::allocator<tensorrt_llm::batch_manager::NamedTensor> > const&, bool, std::string const&)>, std::function<std::unordered_set<unsigned long, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<unsigned long> > ()>, std::function<void (std::string const&)>, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&, std::optional<unsigned long>, std::optional<int>, bool) + 336 |
|                  |         | 6       0x7fb4f0211b62 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x18b62) [0x7fb4f0211b62]                                                                                                            |
|                  |         | 7       0x7fb4f02123f2 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x193f2) [0x7fb4f02123f2]                                                                                                            |
|                  |         | 8       0x7fb4f0204fd5 TRITONBACKEND_ModelInstanceInitialize + 101                                                                                                                                                           |
|                  |         | 9       0x7fb4fa132296 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1ad296) [0x7fb4fa132296]                                                                                                                           |
|                  |         | 10      0x7fb4fa1334d6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1ae4d6) [0x7fb4fa1334d6]                                                                                                                           |
|                  |         | 11      0x7fb4fa116045 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x191045) [0x7fb4fa116045]                                                                                                                           |
|                  |         | 12      0x7fb4fa116686 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x191686) [0x7fb4fa116686]                                                                                                                           |
|                  |         | 13      0x7fb4fa122efd /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19defd) [0x7fb4fa122efd]                                                                                                                           |
|                  |         | 14      0x7fb4f9786ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7fb4f9786ee8]                                                                                                                                        |
|                  |         | 15      0x7fb4fa10cf0b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x187f0b) [0x7fb4fa10cf0b]                                                                                                                           |
|                  |         | 16      0x7fb4fa11dc65 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x198c65) [0x7fb4fa11dc65]                                                                                                                           |
|                  |         | 17      0x7fb4fa12231e /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19d31e) [0x7fb4fa12231e]                                                                                                                           |
|                  |         | 18      0x7fb4fa2140c8 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x28f0c8) [0x7fb4fa2140c8]                                                                                                                           |
|                  |         | 19      0x7fb4fa2179ac /opt/tritonserver/bin/../lib/libtritonserver.so(+0x2929ac) [0x7fb4fa2179ac]                                                                                                                           |
|                  |         | 20      0x7fb4fa36b6c2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3e66c2) [0x7fb4fa36b6c2]                                                                                                                           |
|                  |         | 21      0x7fb4f99f2253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fb4f99f2253]                                                                                                                                   |
|                  |         | 22      0x7fb4f9781ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fb4f9781ac3]                                                                                                                                        |
|                  |         | 23      0x7fb4f9812a04 clone + 68                                                                                                                                                                                            |
| tensorrt_llm_bls | 1       | READY                                                                                                                                                                                                                        |
+------------------+---------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0420 06:14:22.561009 115 metrics.cc:877] Collecting metrics for GPU 0: NVIDIA H100 PCIe
I0420 06:14:22.564576 115 metrics.cc:770] Collecting CPU metrics
I0420 06:14:22.564878 115 tritonserver.cc:2508] 
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                                                                           |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                                                                          |
| server_version                   | 2.43.0                                                                                                                                                                                                          |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
| model_repository_path[0]         | /tensorrtllm_backend/triton_model_repo                                                                                                                                                                          |
| model_control_mode               | MODE_NONE                                                                                                                                                                                                       |
| strict_model_config              | 1                                                                                                                                                                                                               |
| rate_limit                       | OFF                                                                                                                                                                                                             |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                                                       |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                                                                        |
| min_supported_compute_capability | 6.0                                                                                                                                                                                                             |
| strict_readiness                 | 1                                                                                                                                                                                                               |
| exit_timeout                     | 30                                                                                                                                                                                                              |
| cache_enabled                    | 0                                                                                                                                                                                                               |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0420 06:14:22.564894 115 server.cc:307] Waiting for in-flight requests to complete.
I0420 06:14:22.564905 115 server.cc:323] Timeout 30: Found 0 model versions that have in-flight inferences
I0420 06:14:22.565726 115 server.cc:338] All models are stopped, unloading models
I0420 06:14:22.565743 115 server.cc:347] Timeout 30: Found 3 live models and 0 in-flight non-inference requests
I0420 06:14:23.565944 115 server.cc:347] Timeout 29: Found 3 live models and 0 in-flight non-inference requests
Cleaning up...
Cleaning up...
Cleaning up...
I0420 06:14:23.697852 115 model_lifecycle.cc:620] successfully unloaded 'tensorrt_llm_bls' version 1
I0420 06:14:23.738525 115 model_lifecycle.cc:620] successfully unloaded 'preprocessing' version 1
I0420 06:14:23.889608 115 model_lifecycle.cc:620] successfully unloaded 'postprocessing' version 1
I0420 06:14:24.566219 115 server.cc:347] Timeout 28: Found 0 live models and 0 in-flight non-inference requests
error: creating server: Internal - failed to load all models
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[49909,1],0]
  Exit code:    1

additional notes

Despite the error message says expecting library version 9.2.0.5 got 9.3.0.1 here is the contents of /usr/local/tensorrt/include/NvInferVersion.h

/*
 * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 * SPDX-License-Identifier: LicenseRef-NvidiaProprietary
 *
 * NVIDIA CORPORATION, its affiliates and licensors retain all intellectual
 * property and proprietary rights in and to this material, related
 * documentation and any modifications thereto. Any use, reproduction,
 * disclosure or distribution of this material and related documentation
 * without an express license agreement from NVIDIA CORPORATION or
 * its affiliates is strictly prohibited.
 */

//!
//! \file NvInferVersion.h
//!
//! Defines the TensorRT version
//!
#ifndef NV_INFER_VERSION_H
#define NV_INFER_VERSION_H

#define NV_TENSORRT_MAJOR 9 //!< TensorRT major version.
#define NV_TENSORRT_MINOR 2 //!< TensorRT minor version.
#define NV_TENSORRT_PATCH 0 //!< TensorRT patch version.
#define NV_TENSORRT_BUILD 5 //!< TensorRT build number.

#define NV_TENSORRT_LWS_MAJOR 0 //!< TensorRT LWS major version.
#define NV_TENSORRT_LWS_MINOR 0 //!< TensorRT LWS minor version.
#define NV_TENSORRT_LWS_PATCH 0 //!< TensorRT LWS patch version.

// This #define is deprecated in TensorRT 8.6. Use NV_TENSORRT_MAJOR.
#define NV_TENSORRT_SONAME_MAJOR 9 //!< Shared object library major version number.
// This #define is deprecated in TensorRT 8.6. Use NV_TENSORRT_MINOR.
#define NV_TENSORRT_SONAME_MINOR 2 //!< Shared object library minor version number.
// This #define is deprecated in TensorRT 8.6. Use NV_TENSORRT_PATCH.
#define NV_TENSORRT_SONAME_PATCH 0 //!< Shared object library patch version number.

#define NV_TENSORRT_RELEASE_TYPE_EARLY_ACCESS 0         //!< An early access release
#define NV_TENSORRT_RELEASE_TYPE_RELEASE_CANDIDATE 1    //!< A release candidate
#define NV_TENSORRT_RELEASE_TYPE_GENERAL_AVAILABILITY 2 //!< A final release

#define NV_TENSORRT_RELEASE_TYPE NV_TENSORRT_RELEASE_TYPE_GENERAL_AVAILABILITY //!< TensorRT release type

#endif // NV_INFER_VERSION_H

Also according to this The dependent TensorRT version is updated to 9.3

byshiue commented 2 months ago

It should be because your trt version are different in two docker images, could you check it?

conway-abacus commented 2 months ago

Thanks @byshiue do you mean the docker used to build the engine?

>>> import tensorrt
>>> tensorrt.__version__
'9.3.0.post12.dev1'

I was following the guide, should I try to downgrad/rebuild or upgrade in the server docker?

byshiue commented 2 months ago

You could check the TRT version to run the triton. You can upgrade the TRT version of triton docker image to 9.3, or downgrade the TRT version of building engine to 9.2.

Graham1025 commented 2 months ago

You can upgrade the TRT version of triton docker image to 9.3

can you help how to upgrade the trt version of triton docker image to 9.3, source building?

rmccorm4 commented 2 months ago

Hi @conway-abacus, could you try doing everything (both engine building, and starting Triton) in this image: nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3? This should help align the versions for building and runtime to TRTLLM v0.9.0.