triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
654 stars 93 forks source link

mpirun noticed that process rank 0 with PID 0 on node ubuntu exited on signal 11 (Segmentation fault). #206

Open zhaoxjmail opened 9 months ago

zhaoxjmail commented 9 months ago

I build engine Use 2-way tensor parallelism on BLOOM 7B,then i start docker: docker run --rm -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /home/project/tensorrtllm_backend:/tensorrtllm_backend triton_trt_llm:latest bash

in docker i run: python scripts/launch_triton_server.py --world_size=2 --model_repo=/tensorrtllm_backend/triton_model_repo

...
I1212 02:47:53.334754 401 model_lifecycle.cc:818] successfully loaded 'tensorrt_llm'
I1212 02:47:53.334943 401 server.cc:592] 
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I1212 02:47:53.335131 401 server.cc:619] 
+-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------+
| Backend     | Path                                                            | Config                                                                                                                       |
+-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------+
| tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.0000 |
|             |                                                                 | 00","default-max-batch-size":"4"}}                                                                                           |
+-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------+

I1212 02:47:53.335208 401 server.cc:662] 
+--------------+---------+--------+
| Model        | Version | Status |
+--------------+---------+--------+
| tensorrt_llm | 1       | READY  |
+--------------+---------+--------+

I1212 02:47:53.358979 401 metrics.cc:817] Collecting metrics for GPU 0: NVIDIA A800 80GB PCIe
I1212 02:47:53.359002 401 metrics.cc:817] Collecting metrics for GPU 1: NVIDIA A800 80GB PCIe
I1212 02:47:53.359319 401 metrics.cc:710] Collecting CPU metrics
I1212 02:47:53.359487 401 tritonserver.cc:2458] 
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                                      |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                                     |
| server_version                   | 2.39.0                                                                                                                                                                     |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_dat |
|                                  | a parameters statistics trace logging                                                                                                                                      |
| model_repository_path[0]         | /tensorrtllm_backend/triton_model_repo                                                                                                                                     |
| model_control_mode               | MODE_NONE                                                                                                                                                                  |
| strict_model_config              | 1                                                                                                                                                                          |
| rate_limit                       | OFF                                                                                                                                                                        |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                  |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                                   |
| cuda_memory_pool_byte_size{1}    | 67108864                                                                                                                                                                   |
| min_supported_compute_capability | 6.0                                                                                                                                                                        |
| strict_readiness                 | 1                                                                                                                                                                          |
| exit_timeout                     | 30                                                                                                                                                                         |
| cache_enabled                    | 0                                                                                                                                                                          |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I1212 02:47:53.361105 401 grpc_server.cc:2513] Started GRPCInferenceService at 0.0.0.0:8001
I1212 02:47:53.361350 401 http_server.cc:4497] Started HTTPService at 0.0.0.0:8000
Signal (11) received.
I1212 02:47:53.424639 401 http_server.cc:270] Started Metrics Service at 0.0.0.0:8002
Signal (11) received.
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node ubuntu exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
schetlur-nv commented 9 months ago

Can you share the triton log? You can generate this by adding --log to your launch_triton_server.py command line.

zhaoxjmail commented 9 months ago

Can you share the triton log? You can generate this by adding --log to your launch_triton_server.py command line.

log.txt

this console log:

root@ubuntu:/tensorrtllm_backend# I1212 06:51:19.925554 173 pinned_memory_manager.cc:241] Pinned memory pool is created at '0x7fab4c000000' with size 268435456
I1212 06:51:19.943953 173 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I1212 06:51:19.943969 173 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864
I1212 06:51:19.943972 173 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864
I1212 06:51:19.943975 173 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864
I1212 06:51:20.764893 173 model_lifecycle.cc:461] loading: tensorrt_llm:1
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.85 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to true
[TensorRT-LLM][WARNING] exclude_input_in_output is not specified, will be set to false
[TensorRT-LLM][WARNING] max_kv_cache_length is not specified, will use default value
[TensorRT-LLM][WARNING] Parameter pipeline_parallel cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'pipeline_parallel' not found
[TensorRT-LLM][WARNING] Parameter num_kv_heads cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_kv_heads' not found
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_num_tokens' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set.
[TensorRT-LLM][WARNING] Parameter max_prompt_embedding_table_size cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_prompt_embedding_table_size' not found
[TensorRT-LLM][WARNING] Parameter gather_all_token_logits cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'gather_all_token_logits' not found
[TensorRT-LLM][WARNING] Parameter gather_all_token_logits cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'gather_all_token_logits' not found
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.85 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to true
[TensorRT-LLM][WARNING] exclude_input_in_output is not specified, will be set to false
[TensorRT-LLM][WARNING] max_kv_cache_length is not specified, will use default value
[TensorRT-LLM][WARNING] Parameter pipeline_parallel cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'pipeline_parallel' not found
[TensorRT-LLM][WARNING] Parameter num_kv_heads cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_kv_heads' not found
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_num_tokens' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set.
[TensorRT-LLM][WARNING] Parameter max_prompt_embedding_table_size cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_prompt_embedding_table_size' not found
[TensorRT-LLM][WARNING] Parameter gather_all_token_logits cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'gather_all_token_logits' not found
[TensorRT-LLM][WARNING] Parameter gather_all_token_logits cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'gather_all_token_logits' not found
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
[TensorRT-LLM][INFO] MPI size: 2, rank: 0
[TensorRT-LLM][INFO] MPI size: 2, rank: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 512
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 8
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 1
[TensorRT-LLM][INFO] Loaded engine size: 5829 MiB
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 512
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 8
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 1
[TensorRT-LLM][INFO] Loaded engine size: 5829 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 5897, GPU 6807 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 5898, GPU 6817 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 5897, GPU 6807 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 5898, GPU 6817 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +5823, now: CPU 0, GPU 5823 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +5823, now: CPU 0, GPU 5823 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 6131, GPU 8215 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 6131, GPU 8215 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 6131, GPU 8223 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 6131, GPU 8223 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 5823 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 5823 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 6177, GPU 8245 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 6177, GPU 8245 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 6178, GPU 8255 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 6178, GPU 8255 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 5823 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 6224, GPU 8275 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 5823 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 6224, GPU 8275 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 6224, GPU 8285 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 6224, GPU 8285 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 5823 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 5823 (MiB)
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.85 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to true
[TensorRT-LLM][WARNING] exclude_input_in_output is not specified, will be set to false
[TensorRT-LLM][WARNING] max_kv_cache_length is not specified, will use default value
[TensorRT-LLM][WARNING] Parameter pipeline_parallel cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'pipeline_parallel' not found
[TensorRT-LLM][WARNING] Parameter num_kv_heads cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_kv_heads' not found
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_num_tokens' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set.
[TensorRT-LLM][WARNING] Parameter max_prompt_embedding_table_size cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_prompt_embedding_table_size' not found
[TensorRT-LLM][WARNING] Parameter gather_all_token_logits cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'gather_all_token_logits' not found
[TensorRT-LLM][WARNING] Parameter gather_all_token_logits cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'gather_all_token_logits' not found
[TensorRT-LLM][INFO] MPI size: 2, rank: 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 512
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 8
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 1
[TensorRT-LLM][INFO] The logger passed into createInferRuntime differs from one already provided for an existing builder, runtime, or refitter. Uses of the global logger, returned by nvinfer1::getLogger(), will return the existing value.
[TensorRT-LLM][INFO] Loaded engine size: 5829 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 6280, GPU 17973 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 6280, GPU 17983 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +5824, now: CPU 0, GPU 11647 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +8, now: CPU 6281, GPU 19255 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 6281, GPU 19263 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 11647 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 6327, GPU 19281 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 6327, GPU 19291 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 11647 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 6373, GPU 19311 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 6373, GPU 19321 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 11647 (MiB)
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.85 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to true
[TensorRT-LLM][WARNING] exclude_input_in_output is not specified, will be set to false
[TensorRT-LLM][WARNING] max_kv_cache_length is not specified, will use default value
[TensorRT-LLM][WARNING] Parameter pipeline_parallel cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'pipeline_parallel' not found
[TensorRT-LLM][WARNING] Parameter num_kv_heads cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_kv_heads' not found
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_num_tokens' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set.
[TensorRT-LLM][WARNING] Parameter max_prompt_embedding_table_size cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_prompt_embedding_table_size' not found
[TensorRT-LLM][WARNING] Parameter gather_all_token_logits cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'gather_all_token_logits' not found
[TensorRT-LLM][WARNING] Parameter gather_all_token_logits cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'gather_all_token_logits' not found
[TensorRT-LLM][INFO] MPI size: 2, rank: 0
[ubuntu:172  :0:210] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x99)
[ubuntu:00172] *** Process received signal ***
[ubuntu:00172] Signal: Segmentation fault (11)
[ubuntu:00172] Signal code: Address not mapped (1)
[ubuntu:00172] Failing at address: 0x80000440
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node ubuntu exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
root@ubuntu:/tensorrtllm_backend# I1212 06:57:37.139275 219 pinned_memory_manager.cc:241] Pinned memory pool is created at '0x7f4dec000000' with size 268435456
I1212 06:57:37.154286 219 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I1212 06:57:37.154300 219 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864
I1212 06:57:37.154302 219 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864
I1212 06:57:37.154305 219 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864
I1212 06:57:37.948030 219 model_lifecycle.cc:461] loading: tensorrt_llm:1
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.85 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to true
[TensorRT-LLM][WARNING] exclude_input_in_output is not specified, will be set to false
[TensorRT-LLM][WARNING] max_kv_cache_length is not specified, will use default value
[TensorRT-LLM][WARNING] Parameter pipeline_parallel cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'pipeline_parallel' not found
[TensorRT-LLM][WARNING] Parameter num_kv_heads cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_kv_heads' not found
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_num_tokens' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set.
[TensorRT-LLM][WARNING] Parameter max_prompt_embedding_table_size cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_prompt_embedding_table_size' not found
[TensorRT-LLM][WARNING] Parameter gather_all_token_logits cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'gather_all_token_logits' not found
[TensorRT-LLM][WARNING] Parameter gather_all_token_logits cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'gather_all_token_logits' not found
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.85 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to true
[TensorRT-LLM][WARNING] exclude_input_in_output is not specified, will be set to false
[TensorRT-LLM][WARNING] max_kv_cache_length is not specified, will use default value
[TensorRT-LLM][WARNING] Parameter pipeline_parallel cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'pipeline_parallel' not found
[TensorRT-LLM][WARNING] Parameter num_kv_heads cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_kv_heads' not found
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_num_tokens' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set.
[TensorRT-LLM][WARNING] Parameter max_prompt_embedding_table_size cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_prompt_embedding_table_size' not found
[TensorRT-LLM][WARNING] Parameter gather_all_token_logits cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'gather_all_token_logits' not found
[TensorRT-LLM][WARNING] Parameter gather_all_token_logits cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'gather_all_token_logits' not found
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
[TensorRT-LLM][INFO] MPI size: 2, rank: 0
[TensorRT-LLM][INFO] MPI size: 2, rank: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 512
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 8
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 1
[TensorRT-LLM][INFO] Loaded engine size: 5829 MiB
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 512
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 8
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 1
[TensorRT-LLM][INFO] Loaded engine size: 5829 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 5897, GPU 6807 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 5898, GPU 6817 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 5897, GPU 6807 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 5898, GPU 6817 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +5823, now: CPU 0, GPU 5823 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +5823, now: CPU 0, GPU 5823 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 6131, GPU 8215 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 6131, GPU 8223 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 6131, GPU 8215 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 6131, GPU 8223 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 5823 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 5823 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 6177, GPU 8245 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 6177, GPU 8245 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 6178, GPU 8255 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 6178, GPU 8255 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 5823 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 6224, GPU 8275 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 6224, GPU 8285 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 5823 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 6224, GPU 8275 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 6224, GPU 8285 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 5823 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 5823 (MiB)
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.85 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to true
[TensorRT-LLM][WARNING] exclude_input_in_output is not specified, will be set to false
[TensorRT-LLM][WARNING] max_kv_cache_length is not specified, will use default value
[TensorRT-LLM][WARNING] Parameter pipeline_parallel cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'pipeline_parallel' not found
[TensorRT-LLM][WARNING] Parameter num_kv_heads cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_kv_heads' not found
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_num_tokens' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set.
[TensorRT-LLM][WARNING] Parameter max_prompt_embedding_table_size cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_prompt_embedding_table_size' not found
[TensorRT-LLM][WARNING] Parameter gather_all_token_logits cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'gather_all_token_logits' not found
[TensorRT-LLM][WARNING] Parameter gather_all_token_logits cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'gather_all_token_logits' not found
[TensorRT-LLM][INFO] MPI size: 2, rank: 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 512
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 8
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 1
[TensorRT-LLM][INFO] The logger passed into createInferRuntime differs from one already provided for an existing builder, runtime, or refitter. Uses of the global logger, returned by nvinfer1::getLogger(), will return the existing value.
[TensorRT-LLM][INFO] Loaded engine size: 5829 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 6280, GPU 17973 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 6280, GPU 17983 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +5824, now: CPU 0, GPU 11647 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +8, now: CPU 6281, GPU 19255 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 6281, GPU 19263 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 11647 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 6327, GPU 19281 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 6327, GPU 19291 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 11647 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 6373, GPU 19311 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 6373, GPU 19321 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 11647 (MiB)
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.85 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to true
[TensorRT-LLM][WARNING] exclude_input_in_output is not specified, will be set to false
[TensorRT-LLM][WARNING] max_kv_cache_length is not specified, will use default value
[TensorRT-LLM][WARNING] Parameter pipeline_parallel cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'pipeline_parallel' not found
[TensorRT-LLM][WARNING] Parameter num_kv_heads cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_kv_heads' not found
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_num_tokens' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set.
[TensorRT-LLM][WARNING] Parameter max_prompt_embedding_table_size cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_prompt_embedding_table_size' not found
[TensorRT-LLM][WARNING] Parameter gather_all_token_logits cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'gather_all_token_logits' not found
[TensorRT-LLM][WARNING] Parameter gather_all_token_logits cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'gather_all_token_logits' not found
[TensorRT-LLM][INFO] MPI size: 2, rank: 0
[ubuntu:218  :0:253] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x99)
==== backtrace (tid:    253) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x000000000000628f mca_pml_ucx_isend()  /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mca/pml/ucx/pml_ucx.c:862
 2 0x000000000008e82b ompi_coll_base_bcast_intra_generic()  /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mca/coll/base/coll_base_bcast.c:89
 3 0x000000000008efc1 ompi_coll_base_bcast_intra_pipeline()  /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mca/coll/base/coll_base_bcast.c:300
 4 0x0000000000006840 ompi_coll_tuned_bcast_intra_dec_fixed()  /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:649
 5 0x0000000000069841 PMPI_Bcast()  /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mpi/c/profile/pbcast.c:114
 6 0x0000000000069841 PMPI_Bcast()  /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mpi/c/profile/pbcast.c:41
 7 0x00000000000a1b64 tensorrt_llm::mpi::bcast()  :0
 8 0x000000000004b4fc triton::backend::inflight_batcher_llm::ModelInstanceState::get_inference_requests()  :0
 9 0x000000000004bef7 std::_Function_handler<std::list<std::shared_ptr<tensorrt_llm::batch_manager::InferenceRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::InferenceRequest> > > (int), triton::backend::inflight_batcher_llm::ModelInstanceState::ModelInstanceState(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*)::{lambda(int)#1}>::_M_invoke()  :0
10 0x0000000000061a64 tensorrt_llm::batch_manager::GptManager::fetchNewRequests()  :0
11 0x0000000000064ac6 tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop()  :0
12 0x00000000000dc253 std::error_code::default_error_condition()  ???:0
13 0x0000000000094ac3 pthread_condattr_setpshared()  ???:0
14 0x0000000000125bf4 clone()  ???:0
=================================
[ubuntu:00218] *** Process received signal ***
[ubuntu:00218] Signal: Segmentation fault (11)
[ubuntu:00218] Signal code:  (-6)
[ubuntu:00218] Failing at address: 0xda
[ubuntu:00218] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f3885fa2520]
[ubuntu:00218] [ 1] /opt/hpcx/ompi/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_isend+0xbf)[0x7f387002628f]
[ubuntu:00218] [ 2] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_coll_base_bcast_intra_generic+0x14b)[0x7f38711bf82b]
[ubuntu:00218] [ 3] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_coll_base_bcast_intra_pipeline+0xd1)[0x7f38711bffc1]
[ubuntu:00218] [ 4] /opt/hpcx/ompi/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0x40)[0x7f3822898840]
[ubuntu:00218] [ 5] /opt/hpcx/ompi/lib/libmpi.so.40(MPI_Bcast+0x41)[0x7f387119a841]
[ubuntu:00218] [ 6] /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0xa1b64)[0x7f3793bbfb64]
[ubuntu:00218] [ 7] /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x4b4fc)[0x7f3793b694fc]
[ubuntu:00218] [ 8] /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x4bef7)[0x7f3793b69ef7]
[ubuntu:00218] [ 9] /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x61a64)[0x7f3793b7fa64]
[ubuntu:00218] [10] /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x64ac6)[0x7f3793b82ac6]
[ubuntu:00218] [11] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253)[0x7f3886264253]
[ubuntu:00218] [12] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3)[0x7f3885ff4ac3]
[ubuntu:00218] [13] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x44)[0x7f3886085bf4]
[ubuntu:00218] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node ubuntu exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
zhaoxjmail commented 9 months ago

Can you share the triton log? You can generate this by adding --log to your launch_triton_server.py command line.

@schetlur-nv What's the progress on this issue?

schetlur-nv commented 9 months ago

These error messages point to an incompatible engine. Can you share the steps you used to build the engine? Or it may be faster to try rebuilding the engine and deploying with the same version of TRT-LLM.

zhaoxjmail commented 9 months ago

These error messages point to an incompatible engine. Can you share the steps you used to build the engine? Or it may be faster to try rebuilding the engine and deploying with the same version of TRT-LLM.

@schetlur-nv thank you!

I using V0.6.1

tensorrt 9.2.0.post12.dev5 tensorrt-llm 0.6.1

1.I refer to https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/installation.md build TensorRT-LLm

# TensorRT-LLM uses git-lfs, which needs to be installed in advance.
apt-get update && apt-get -y install git git-lfs

git clone --branch v0.6.1 https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM
git submodule update --init --recursive
git lfs install
git lfs pull

make -C docker release_build

2.I refer to https://github.com/triton-inference-server/tensorrtllm_backend#option-3-build-via-docker build tensorrtllm_backend


git clone --branch v0.6.1 https://github.com/triton-inference-server/tensorrtllm_backend.git
# Update the submodules
cd tensorrtllm_backend
git lfs install
git submodule update --init --recursive

# Use the Dockerfile to build the backend in a container
# For x86_64
DOCKER_BUILDKIT=1 docker build -t triton_trt_llm -f dockerfile/Dockerfile.trt_llm_backend .

Additionally, this issue only occurs when using tensor parallelism

achartier commented 8 months ago

Could you try v0.7.1 for TRT-LLM and tensorrtllm_backend? I was not able to reproduce the issue on this version with your configuration settings. If the error persists, could you share the steps you used to build the Bloom engine? This is the command I used:

trtllm-build --checkpoint_dir ./bloom/560M/trt_ckpt/fp16/2-gpu/ --use_gemm_plugin float16 --use_gpt_attention_plugin float16 --output_dir ./bloom/560M/trt_engines/fp16/2-gpu/ --paged_kv_cache --remove_input_padding

This is similar to the TRT-LLM example (https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/bloom#build-tensorrt-engines) but note the addition of the last 2 options.

zhaoxjmail commented 7 months ago

Could you try v0.7.1 for TRT-LLM and tensorrtllm_backend? I was not able to reproduce the issue on this version with your configuration settings. If the error persists, could you share the steps you used to build the Bloom engine? This is the command I used:

trtllm-build --checkpoint_dir ./bloom/560M/trt_ckpt/fp16/2-gpu/ --use_gemm_plugin float16 --use_gpt_attention_plugin float16 --output_dir ./bloom/560M/trt_engines/fp16/2-gpu/ --paged_kv_cache --remove_input_padding

This is similar to the TRT-LLM example (https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/bloom#build-tensorrt-engines) but note the addition of the last 2 options.

@achartier Thanks very much! I using V0.61. follow is my script:

# Use 2-way tensor parallelism on BLOOM
python build.py --model_dir /code/tensorrt_llm/saved_model_1201 \
    --dtype float16 \
    --use_gemm_plugin float16 \
    --use_gpt_attention_plugin float16 \
    --use_weight_only \
    --output_dir ./bloom/ \
    --world_size 2
achartier commented 7 months ago

Would you be able to try v0.7.1? It looks like the issue is fixed with that version.

zhaoxjmail commented 7 months ago

Would you be able to try v0.7.1? It looks like the issue is fixed with that version.

@achartier Thanks very much! I will try it

zhaoxjmail commented 7 months ago

@achartier i using v0.7.1 but this error still exists.follow is console log:

root@ubuntu:/tensorrtllm_backend# python3 scripts/launch_triton_server.py --world_size=2 --model_repo=/tensorrtllm_backend/triton_model_repo --log --log-file=./log.txt
root@ubuntu:/tensorrtllm_backend# I0129 02:15:21.739898 525 pinned_memory_manager.cc:241] Pinned memory pool is created at '0x7f509e000000' with size 268435456
I0129 02:15:21.763659 525 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I0129 02:15:21.763671 525 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864
I0129 02:15:21.763674 525 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864
I0129 02:15:21.763677 525 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864
I0129 02:15:22.634696 525 model_lifecycle.cc:461] loading: tensorrt_llm:1
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.9 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] exclude_input_in_output is not specified, will be set to false
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be number, but is null
[TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.9 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] exclude_input_in_output is not specified, will be set to false
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be number, but is null
[TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
[TensorRT-LLM][INFO] MPI size: 2, rank: 1
[TensorRT-LLM][INFO] MPI size: 2, rank: 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 2048
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] Loaded engine size: 8706 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 8779, GPU 23263 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 8781, GPU 23273 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 2048
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] Loaded engine size: 8706 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 8779, GPU 9699 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 8781, GPU 9709 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +8702, now: CPU 0, GPU 8702 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +8702, now: CPU 0, GPU 8702 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 8935, GPU 10055 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 8935, GPU 10063 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 8935, GPU 23617 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 8935, GPU 23625 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 8702 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 8702 (MiB)
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.9 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] exclude_input_in_output is not specified, will be set to false
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be number, but is null
[TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set.
[TensorRT-LLM][INFO] MPI size: 2, rank: 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 2048
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] The logger passed into createInferRuntime differs from one already provided for an existing builder, runtime, or refitter. Uses of the global logger, returned by nvinfer1::getLogger(), will return the existing value.
[TensorRT-LLM][INFO] Loaded engine size: 8706 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 8985, GPU 32827 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 8985, GPU 32837 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +8702, now: CPU 0, GPU 17404 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 8985, GPU 33021 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 8985, GPU 33029 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 17404 (MiB)
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.9 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] exclude_input_in_output is not specified, will be set to false
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be number, but is null
[TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set.
[TensorRT-LLM][INFO] MPI size: 2, rank: 0
[ubuntu:524  :0:562] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x99)
==== backtrace (tid:    562) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x000000000000628f mca_pml_ucx_isend()  /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mca/pml/ucx/pml_ucx.c:862
 2 0x000000000008e82b ompi_coll_base_bcast_intra_generic()  /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mca/coll/base/coll_base_bcast.c:89
 3 0x000000000008efc1 ompi_coll_base_bcast_intra_pipeline()  /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mca/coll/base/coll_base_bcast.c:300
 4 0x0000000000006840 ompi_coll_tuned_bcast_intra_dec_fixed()  /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:649
 5 0x0000000000069841 PMPI_Bcast()  /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mpi/c/profile/pbcast.c:114
 6 0x0000000000069841 PMPI_Bcast()  /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mpi/c/profile/pbcast.c:41
 7 0x00000000000b6087 tensorrt_llm::mpi::MpiComm::bcast()  :0
 8 0x0000000000046ee3 triton::backend::inflight_batcher_llm::ModelInstanceState::get_inference_requests()  :0
 9 0x00000000000477c7 std::_Function_handler<std::list<std::shared_ptr<tensorrt_llm::batch_manager::InferenceRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::InferenceRequest> > > (int), triton::backend::inflight_batcher_llm::ModelInstanceState::ModelInstanceState(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*)::{lambda(int)#1}>::_M_invoke()  model_instance_state.cc:0
10 0x000000000006c304 tensorrt_llm::batch_manager::GptManager::fetchNewRequests()  :0
11 0x000000000006e068 tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop()  :0
12 0x00000000000dc253 std::error_code::default_error_condition()  ???:0
13 0x0000000000094ac3 pthread_condattr_setpshared()  ???:0
14 0x0000000000125814 clone()  ???:0
=================================
[ubuntu:00524] *** Process received signal ***
[ubuntu:00524] Signal: Segmentation fault (11)
[ubuntu:00524] Signal code:  (-6)
[ubuntu:00524] Failing at address: 0x20c
[ubuntu:00524] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7fc184b8d520]
[ubuntu:00524] [ 1] /opt/hpcx/ompi/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_isend+0xbf)[0x7fc15803c28f]
[ubuntu:00524] [ 2] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_coll_base_bcast_intra_generic+0x14b)[0x7fc17023382b]
[ubuntu:00524] [ 3] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_coll_base_bcast_intra_pipeline+0xd1)[0x7fc170233fc1]
[ubuntu:00524] [ 4] /opt/hpcx/ompi/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0x40)[0x7fc15283e840]
[ubuntu:00524] [ 5] /opt/hpcx/ompi/lib/libmpi.so.40(MPI_Bcast+0x41)[0x7fc17020e841]
[ubuntu:00524] [ 6] /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0xb6087)[0x7fc08c95f087]
[ubuntu:00524] [ 7] /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x46ee3)[0x7fc08c8efee3]
[ubuntu:00524] [ 8] /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x477c7)[0x7fc08c8f07c7]
[ubuntu:00524] [ 9] /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x6c304)[0x7fc08c915304]
[ubuntu:00524] [10] /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x6e068)[0x7fc08c917068]
[ubuntu:00524] [11] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253)[0x7fc184e4f253]
[ubuntu:00524] [12] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3)[0x7fc184bdfac3]
[ubuntu:00524] [13] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x44)[0x7fc184c70814]
[ubuntu:00524] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node ubuntu exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

triton_server log:log.txt

ompi_info show:

                Package: Open MPI root@hpcx-builder02 Distribution
                Open MPI: 4.1.5rc2
  Open MPI repo revision: v4.1.5rc1-17-gdb10576f40
   Open MPI release date: Unreleased developer copy
                Open RTE: 4.1.5rc2
  Open RTE repo revision: v4.1.5rc1-17-gdb10576f40
   Open RTE release date: Unreleased developer copy
                    OPAL: 4.1.5rc2
      OPAL repo revision: v4.1.5rc1-17-gdb10576f40
       OPAL release date: Unreleased developer copy
                 MPI API: 3.1.0
            Ident string: 4.1.5rc2
                  Prefix: /opt/hpcx/ompi
 Configured architecture: x86_64-pc-linux-gnu
          Configure host: hpcx-builder02
           Configured by: root
           Configured on: Tue Aug 22 16:31:11 UTC 2023
          Configure host: hpcx-builder02
  Configure command line: '--prefix=/build-result/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi'
                          '--with-libevent=internal'
                          '--enable-mpi1-compatibility' '--without-xpmem'
                          '--with-cuda=/hpc/local/oss/cuda12.1.1'
                          '--with-slurm'
                          '--with-platform=contrib/platform/mellanox/optimized'
                          '--with-hcoll=/build-result/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/hcoll'
                          '--with-ucx=/build-result/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ucx'
                          '--with-ucc=/build-result/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ucc'
                Built by: 
                Built on: Tue Aug 22 16:41:49 UTC 2023
              Built host: hpcx-builder02
              C bindings: yes
            C++ bindings: no
             Fort mpif.h: yes (all)
            Fort use mpi: yes (full: ignore TKR)
       Fort use mpi size: deprecated-ompi-info-value
        Fort use mpi_f08: yes
 Fort mpi_f08 compliance: The mpi_f08 module is available, but due to
                          limitations in the gfortran compiler and/or Open
                          MPI, does not support the following: array
                          subsections, direct passthru (where possible) to
                          underlying Open MPI's C functionality
  Fort mpi_f08 subarrays: no
           Java bindings: no
  Wrapper compiler rpath: runpath
              C compiler: gcc
     C compiler absolute: /usr/bin/gcc
  C compiler family name: GNU
      C compiler version: 11.2.0
            C++ compiler: g++
   C++ compiler absolute: /usr/bin/g++
           Fort compiler: gfortran
       Fort compiler abs: /usr/bin/gfortran
         Fort ignore TKR: yes (!GCC$ ATTRIBUTES NO_ARG_CHECK ::)
   Fort 08 assumed shape: yes
      Fort optional args: yes
          Fort INTERFACE: yes
    Fort ISO_FORTRAN_ENV: yes
       Fort STORAGE_SIZE: yes
      Fort BIND(C) (all): yes
      Fort ISO_C_BINDING: yes
 Fort SUBROUTINE BIND(C): yes
       Fort TYPE,BIND(C): yes
 Fort T,BIND(C,name="a"): yes
            Fort PRIVATE: yes
          Fort PROTECTED: yes
           Fort ABSTRACT: yes
       Fort ASYNCHRONOUS: yes
          Fort PROCEDURE: yes
         Fort USE...ONLY: yes
           Fort C_FUNLOC: yes
 Fort f08 using wrappers: yes
         Fort MPI_SIZEOF: yes
             C profiling: yes
           C++ profiling: no
   Fort mpif.h profiling: yes
  Fort use mpi profiling: yes
   Fort use mpi_f08 prof: yes
          C++ exceptions: no
          Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes,
                          OMPI progress: no, ORTE progress: yes, Event lib:
                          yes)
           Sparse Groups: no
  Internal debug support: no
  MPI interface warnings: yes
     MPI parameter check: never
Memory profiling support: no
Memory debugging support: no
              dl support: yes
   Heterogeneous support: no
 mpirun default --prefix: yes
       MPI_WTIME support: native
     Symbol vis. support: yes
   Host topology support: yes
            IPv6 support: no
      MPI1 compatibility: yes
          MPI extensions: affinity, cuda, pcollreq
   FT Checkpoint support: no (checkpoint thread: no)
   C/R Enabled Debugging: no
  MPI_MAX_PROCESSOR_NAME: 256
    MPI_MAX_ERROR_STRING: 256
     MPI_MAX_OBJECT_NAME: 64
        MPI_MAX_INFO_KEY: 36
        MPI_MAX_INFO_VAL: 256
       MPI_MAX_PORT_NAME: 1024
  MPI_MAX_DATAREP_STRING: 128
           MCA allocator: basic (MCA v2.1.0, API v2.0.0, Component v4.1.5)
           MCA allocator: bucket (MCA v2.1.0, API v2.0.0, Component v4.1.5)
           MCA backtrace: execinfo (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                 MCA btl: vader (MCA v2.1.0, API v3.1.0, Component v4.1.5)
                 MCA btl: smcuda (MCA v2.1.0, API v3.1.0, Component v4.1.5)
                 MCA btl: self (MCA v2.1.0, API v3.1.0, Component v4.1.5)
                 MCA btl: tcp (MCA v2.1.0, API v3.1.0, Component v4.1.5)
            MCA compress: gzip (MCA v2.1.0, API v2.0.0, Component v4.1.5)
            MCA compress: bzip (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                 MCA crs: none (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                  MCA dl: dlopen (MCA v2.1.0, API v1.0.0, Component v4.1.5)
               MCA event: libevent2022 (MCA v2.1.0, API v2.0.0, Component
                          v4.1.5)
               MCA hwloc: hwloc201 (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                  MCA if: linux_ipv6 (MCA v2.1.0, API v2.0.0, Component
                          v4.1.5)
                  MCA if: posix_ipv4 (MCA v2.1.0, API v2.0.0, Component
                          v4.1.5)
         MCA installdirs: env (MCA v2.1.0, API v2.0.0, Component v4.1.5)
         MCA installdirs: config (MCA v2.1.0, API v2.0.0, Component v4.1.5)
              MCA memory: patcher (MCA v2.1.0, API v2.0.0, Component v4.1.5)
               MCA mpool: hugepage (MCA v2.1.0, API v3.0.0, Component v4.1.5)
             MCA patcher: overwrite (MCA v2.1.0, API v1.0.0, Component
                          v4.1.5)
                MCA pmix: flux (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                MCA pmix: pmix3x (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                MCA pmix: isolated (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                MCA pmix: s2 (MCA v2.1.0, API v2.0.0, Component v4.1.5)
               MCA pstat: linux (MCA v2.1.0, API v2.0.0, Component v4.1.5)
              MCA rcache: gpusm (MCA v2.1.0, API v3.3.0, Component v4.1.5)
              MCA rcache: grdma (MCA v2.1.0, API v3.3.0, Component v4.1.5)
              MCA rcache: rgpusm (MCA v2.1.0, API v3.3.0, Component v4.1.5)
           MCA reachable: netlink (MCA v2.1.0, API v2.0.0, Component v4.1.5)
           MCA reachable: weighted (MCA v2.1.0, API v2.0.0, Component v4.1.5)
               MCA shmem: mmap (MCA v2.1.0, API v2.0.0, Component v4.1.5)
               MCA shmem: sysv (MCA v2.1.0, API v2.0.0, Component v4.1.5)
               MCA shmem: posix (MCA v2.1.0, API v2.0.0, Component v4.1.5)
               MCA timer: linux (MCA v2.1.0, API v2.0.0, Component v4.1.5)
              MCA errmgr: default_orted (MCA v2.1.0, API v3.0.0, Component
                          v4.1.5)
              MCA errmgr: default_tool (MCA v2.1.0, API v3.0.0, Component
                          v4.1.5)
              MCA errmgr: default_app (MCA v2.1.0, API v3.0.0, Component
                          v4.1.5)
              MCA errmgr: default_hnp (MCA v2.1.0, API v3.0.0, Component
                          v4.1.5)
                 MCA ess: pmi (MCA v2.1.0, API v3.0.0, Component v4.1.5)
                 MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component v4.1.5)
                 MCA ess: hnp (MCA v2.1.0, API v3.0.0, Component v4.1.5)
                 MCA ess: env (MCA v2.1.0, API v3.0.0, Component v4.1.5)
                 MCA ess: singleton (MCA v2.1.0, API v3.0.0, Component
                          v4.1.5)
                 MCA ess: tool (MCA v2.1.0, API v3.0.0, Component v4.1.5)
               MCA filem: raw (MCA v2.1.0, API v2.0.0, Component v4.1.5)
             MCA grpcomm: direct (MCA v2.1.0, API v3.0.0, Component v4.1.5)
                 MCA iof: orted (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                 MCA iof: hnp (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                 MCA iof: tool (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                MCA odls: default (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                MCA odls: pspawn (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                 MCA oob: tcp (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                 MCA plm: rsh (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                 MCA plm: slurm (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                 MCA plm: isolated (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                 MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                 MCA ras: simulator (MCA v2.1.0, API v2.0.0, Component
                          v4.1.5)
                MCA regx: fwd (MCA v2.1.0, API v1.0.0, Component v4.1.5)
                MCA regx: reverse (MCA v2.1.0, API v1.0.0, Component v4.1.5)
                MCA regx: naive (MCA v2.1.0, API v1.0.0, Component v4.1.5)
               MCA rmaps: resilient (MCA v2.1.0, API v2.0.0, Component
                          v4.1.5)
               MCA rmaps: rank_file (MCA v2.1.0, API v2.0.0, Component
                          v4.1.5)
               MCA rmaps: round_robin (MCA v2.1.0, API v2.0.0, Component
                          v4.1.5)
               MCA rmaps: ppr (MCA v2.1.0, API v2.0.0, Component v4.1.5)
               MCA rmaps: seq (MCA v2.1.0, API v2.0.0, Component v4.1.5)
               MCA rmaps: mindist (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                 MCA rml: oob (MCA v2.1.0, API v3.0.0, Component v4.1.5)
              MCA routed: binomial (MCA v2.1.0, API v3.0.0, Component v4.1.5)
              MCA routed: radix (MCA v2.1.0, API v3.0.0, Component v4.1.5)
              MCA routed: direct (MCA v2.1.0, API v3.0.0, Component v4.1.5)
                 MCA rtc: hwloc (MCA v2.1.0, API v1.0.0, Component v4.1.5)
              MCA schizo: flux (MCA v2.1.0, API v1.0.0, Component v4.1.5)
              MCA schizo: jsm (MCA v2.1.0, API v1.0.0, Component v4.1.5)
              MCA schizo: orte (MCA v2.1.0, API v1.0.0, Component v4.1.5)
              MCA schizo: ompi (MCA v2.1.0, API v1.0.0, Component v4.1.5)
              MCA schizo: slurm (MCA v2.1.0, API v1.0.0, Component v4.1.5)
               MCA state: hnp (MCA v2.1.0, API v1.0.0, Component v4.1.5)
               MCA state: tool (MCA v2.1.0, API v1.0.0, Component v4.1.5)
               MCA state: orted (MCA v2.1.0, API v1.0.0, Component v4.1.5)
               MCA state: novm (MCA v2.1.0, API v1.0.0, Component v4.1.5)
               MCA state: app (MCA v2.1.0, API v1.0.0, Component v4.1.5)
                 MCA bml: r2 (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                MCA coll: cuda (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                MCA coll: libnbc (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                MCA coll: hcoll (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                MCA coll: han (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                MCA coll: sync (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                MCA coll: ucc (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                MCA coll: monitoring (MCA v2.1.0, API v2.0.0, Component
                          v4.1.5)
                MCA coll: basic (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                MCA coll: tuned (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                MCA coll: adapt (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                MCA coll: inter (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                MCA coll: self (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                MCA coll: sm (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                MCA fbtl: posix (MCA v2.1.0, API v2.0.0, Component v4.1.5)
               MCA fcoll: dynamic_gen2 (MCA v2.1.0, API v2.0.0, Component
                          v4.1.5)
               MCA fcoll: dynamic (MCA v2.1.0, API v2.0.0, Component v4.1.5)
               MCA fcoll: two_phase (MCA v2.1.0, API v2.0.0, Component
                          v4.1.5)
               MCA fcoll: vulcan (MCA v2.1.0, API v2.0.0, Component v4.1.5)
               MCA fcoll: individual (MCA v2.1.0, API v2.0.0, Component
                          v4.1.5)
                  MCA fs: ufs (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                  MCA io: romio321 (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                  MCA io: ompio (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                  MCA op: avx (MCA v2.1.0, API v1.0.0, Component v4.1.5)
                 MCA osc: pt2pt (MCA v2.1.0, API v3.0.0, Component v4.1.5)
                 MCA osc: monitoring (MCA v2.1.0, API v3.0.0, Component
                          v4.1.5)
                 MCA osc: ucx (MCA v2.1.0, API v3.0.0, Component v4.1.5)
                 MCA osc: sm (MCA v2.1.0, API v3.0.0, Component v4.1.5)
                 MCA osc: rdma (MCA v2.1.0, API v3.0.0, Component v4.1.5)
                 MCA pml: v (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                 MCA pml: ob1 (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                 MCA pml: ucx (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                 MCA pml: cm (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                 MCA pml: monitoring (MCA v2.1.0, API v2.0.0, Component
                          v4.1.5)
                 MCA rte: orte (MCA v2.1.0, API v2.0.0, Component v4.1.5)
            MCA sharedfp: sm (MCA v2.1.0, API v2.0.0, Component v4.1.5)
            MCA sharedfp: individual (MCA v2.1.0, API v2.0.0, Component
                          v4.1.5)
            MCA sharedfp: lockedfile (MCA v2.1.0, API v2.0.0, Component
                          v4.1.5)
                MCA topo: basic (MCA v2.1.0, API v2.2.0, Component v4.1.5)
                MCA topo: treematch (MCA v2.1.0, API v2.2.0, Component
                          v4.1.5)
           MCA vprotocol: pessimist (MCA v2.1.0, API v2.0.0, Component

mpi4py version 3.1.5

achartier commented 7 months ago

Sorry, I have not been able to reproduce the issue yet. Meanwhile, could you try the following steps:

  1. Add --paged_kv_cache --remove_input_padding to the engine build
  2. Remove --use_weight_only from the engine build
  3. Combination of 1 and 2
zhaoxjmail commented 7 months ago

Sorry, I have not been able to reproduce the issue yet. Meanwhile, could you try the following steps:

  1. Add --paged_kv_cache --remove_input_padding to the engine build
  2. Remove --use_weight_only from the engine build
  3. Combination of 1 and 2

ok,thanks i will try it soon

zhaoxjmail commented 7 months ago

@achartier this error still exists. below is script and log:

#Use 4-way tensor parallelism on BLOOM
python convert_checkpoint.py --model_dir ./saved_model_1201 \
                --dtype float16 \
                --output_dir ./bloom/trt_ckpt/fp16/4-gpu/ \
                --world_size 4

trtllm-build --checkpoint_dir ./bloom/trt_ckpt/fp16/4-gpu/ \
                --use_gemm_plugin float16 \
                --use_gpt_attention_plugin float16 \
        --paged_kv_cache \
        --remove_input_padding \
                --output_dir ./bloom/trt_engines/fp16/4-gpu/

root@ubuntu:/tensorrtllm_backend# python3 scripts/launch_triton_server.py --world_size=4 --model_repo=/tensorrtllm_backend/triton_model_repo --log --log-file=./log.txt
root@ubuntu:/tensorrtllm_backend# I0131 00:16:48.740042 118 pinned_memory_manager.cc:241] Pinned memory pool is created at '0x7f0e4e000000' with size 268435456
I0131 00:16:48.740084 119 pinned_memory_manager.cc:241] Pinned memory pool is created at '0x7ff4de000000' with size 268435456
I0131 00:16:48.741732 117 pinned_memory_manager.cc:241] Pinned memory pool is created at '0x7fbb14000000' with size 268435456
I0131 00:16:48.761204 119 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I0131 00:16:48.761215 119 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864
I0131 00:16:48.761218 119 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864
I0131 00:16:48.761221 119 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864
I0131 00:16:48.762586 118 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I0131 00:16:48.762600 118 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864
I0131 00:16:48.762603 118 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864
I0131 00:16:48.762606 118 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864
I0131 00:16:48.768432 117 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I0131 00:16:48.768446 117 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864
I0131 00:16:48.768450 117 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864
I0131 00:16:48.768453 117 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864
I0131 00:16:51.370436 118 model_lifecycle.cc:461] loading: tensorrt_llm:1
I0131 00:16:51.371312 119 model_lifecycle.cc:461] loading: tensorrt_llm:1
I0131 00:16:51.393087 117 model_lifecycle.cc:461] loading: tensorrt_llm:1
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.9 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] exclude_input_in_output is not specified, will be set to false
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be number, but is null
[TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.9 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] exclude_input_in_output is not specified, will be set to false
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be number, but is null
[TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.9 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] exclude_input_in_output is not specified, will be set to false
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be number, but is null
[TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.9 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] exclude_input_in_output is not specified, will be set to false
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be number, but is null
[TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
[TensorRT-LLM][INFO] MPI size: 4, rank: 1
[TensorRT-LLM][INFO] MPI size: 4, rank: 0
[TensorRT-LLM][INFO] MPI size: 4, rank: 3
[TensorRT-LLM][INFO] MPI size: 4, rank: 2
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 2048
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 2048
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] Loaded engine size: 5335 MiB
[TensorRT-LLM][INFO] Loaded engine size: 5335 MiB
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 2048
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] Loaded engine size: 5335 MiB
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 2048
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] Loaded engine size: 5335 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 5407, GPU 7296 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 5407, GPU 7296 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 5409, GPU 7306 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 5409, GPU 7306 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 5407, GPU 7296 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 5409, GPU 7306 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 5407, GPU 7296 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 5409, GPU 7306 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +5331, now: CPU 0, GPU 5331 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +5331, now: CPU 0, GPU 5331 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +5331, now: CPU 0, GPU 5331 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +5331, now: CPU 0, GPU 5331 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 5563, GPU 7516 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 5563, GPU 7516 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 5563, GPU 7516 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 5564, GPU 7524 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 5564, GPU 7524 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 5563, GPU 7516 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 5564, GPU 7524 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 5564, GPU 7524 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 5331 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 5331 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 5331 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 5331 (MiB)
[TensorRT-LLM][INFO] Allocate 69394759680 bytes for k/v cache. 
[TensorRT-LLM][INFO] Using 564736 tokens in paged KV cache.
[TensorRT-LLM][INFO] Allocate 69394759680 bytes for k/v cache. 
[TensorRT-LLM][INFO] Using 564736 tokens in paged KV cache.
[TensorRT-LLM][INFO] Allocate 69394759680 bytes for k/v cache. 
[TensorRT-LLM][INFO] Using 564736 tokens in paged KV cache.
[TensorRT-LLM][INFO] Allocate 69394759680 bytes for k/v cache. 
[TensorRT-LLM][INFO] Using 564736 tokens in paged KV cache.
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.9 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] exclude_input_in_output is not specified, will be set to false
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be number, but is null
[TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set.
[TensorRT-LLM][INFO] MPI size: 4, rank: 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 2048
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] The logger passed into createInferRuntime differs from one already provided for an existing builder, runtime, or refitter. Uses of the global logger, returned by nvinfer1::getLogger(), will return the existing value.
[TensorRT-LLM][INFO] Loaded engine size: 5335 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 5610, GPU 79052 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 5610, GPU 79062 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +5332, now: CPU 0, GPU 10663 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 5610, GPU 79182 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 5610, GPU 79190 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 10663 (MiB)
[ubuntu:116  :0:148] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid:    148) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x0000000000051862 ucs_list_del()  /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ucx-bf8f1b66767231605126806d837cba26d1b12afa/src/ucs/datastruct/list.h:105
 2 0x0000000000051862 ucs_arbiter_dispatch_nonempty()  /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ucx-bf8f1b66767231605126806d837cba26d1b12afa/src/ucs/datastruct/arbiter.c:284
 3 0x000000000001abfe ucs_arbiter_dispatch()  /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ucx-bf8f1b66767231605126806d837cba26d1b12afa/src/ucs/datastruct/arbiter.h:386
 4 0x000000000001abfe uct_mm_iface_progress()  /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ucx-bf8f1b66767231605126806d837cba26d1b12afa/src/uct/sm/mm/base/mm_iface.c:388
 5 0x000000000004ea5a ucs_callbackq_dispatch()  /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ucx-bf8f1b66767231605126806d837cba26d1b12afa/src/ucs/datastruct/callbackq.h:211
 6 0x000000000004ea5a uct_worker_progress()  /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ucx-bf8f1b66767231605126806d837cba26d1b12afa/src/uct/api/uct.h:2777
 7 0x000000000004ea5a ucp_worker_progress()  /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ucx-bf8f1b66767231605126806d837cba26d1b12afa/src/ucp/core/ucp_worker.c:2885
 8 0x000000000003a8f4 opal_progress()  /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/opal/runtime/opal_progress.c:231
 9 0x00000000000412bd ompi_sync_wait_mt()  /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/opal/threads/wait_sync.c:85
10 0x000000000005463b ompi_request_wait_completion()  /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/../ompi/request/request.h:428
11 0x000000000005463b ompi_request_default_wait()  /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/request/req_wait.c:42
12 0x0000000000093c93 ompi_coll_base_sendrecv_actual()  /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mca/coll/base/coll_base_util.c:62
13 0x00000000000952c8 ompi_coll_base_allreduce_intra_recursivedoubling()  /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mca/coll/base/coll_base_allreduce.c:219
14 0x00000000000969d1 ompi_coll_base_allreduce_intra_ring()  /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mca/coll/base/coll_base_allreduce.c:373
15 0x000000000000608f ompi_coll_tuned_allreduce_intra_dec_fixed()  /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:216
16 0x0000000000068a13 PMPI_Allreduce()  /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mpi/c/profile/pallreduce.c:113
17 0x0000000000068a13 opal_obj_update()  /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mpi/c/profile/../../../../opal/class/opal_object.h:534
18 0x0000000000068a13 PMPI_Allreduce()  /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mpi/c/profile/pallreduce.c:116
19 0x0000000000068a13 PMPI_Allreduce()  /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mpi/c/profile/pallreduce.c:46
20 0x00000000000b675e tensorrt_llm::mpi::MpiComm::allreduce()  :0
21 0x000000000009de98 tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager::getMaxNumTokens()  :0
22 0x00000000000f6a26 tensorrt_llm::runtime::GptSession::createKvCacheManager()  :0
23 0x00000000000f8572 tensorrt_llm::runtime::GptSession::setup()  :0
24 0x00000000000f8ab5 tensorrt_llm::runtime::GptSession::GptSession()  :0
25 0x0000000000090c39 tensorrt_llm::batch_manager::TrtGptModelV1::TrtGptModelV1()  :0
26 0x000000000007107d tensorrt_llm::batch_manager::TrtGptModelFactory::create()  :0
27 0x00000000000663b0 tensorrt_llm::batch_manager::GptManager::GptManager()  :0
28 0x0000000000048fa1 triton::backend::inflight_batcher_llm::ModelInstanceState::ModelInstanceState()  :0
29 0x0000000000049fc2 triton::backend::inflight_batcher_llm::ModelInstanceState::Create()  :0
30 0x000000000003bad5 TRITONBACKEND_ModelInstanceInitialize()  ???:0
31 0x00000000001a7226 triton::core::TritonModelInstance::ConstructAndInitializeInstance()  :0
32 0x00000000001a8466 triton::core::TritonModelInstance::CreateInstance()  :0
33 0x000000000018b165 triton::core::TritonModel::PrepareInstances(inference::ModelConfig const&, std::vector<std::shared_ptr<triton::core::TritonModelInstance>, std::allocator<std::shared_ptr<triton::core::TritonModelInstance> > >*, std::vector<std::shared_ptr<triton::core::TritonModelInstance>, std::allocator<std::shared_ptr<triton::core::TritonModelInstance> > >*)::{lambda()#1}::operator()()  backend_model.cc:0
34 0x000000000018b7a6 std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<triton::core::Status>, std::__future_base::_Result_base::_Deleter>, std::thread::_Invoker<std::tuple<triton::core::TritonModel::PrepareInstances(inference::ModelConfig const&, std::vector<std::shared_ptr<triton::core::TritonModelInstance>, std::allocator<std::shared_ptr<triton::core::TritonModelInstance> > >*, std::vector<std::shared_ptr<triton::core::TritonModelInstance>, std::allocator<std::shared_ptr<triton::core::TritonModelInstance> > >*)::{lambda()#1}> >, triton::core::Status> >::_M_invoke()  backend_model.cc:0
35 0x0000000000197a1d std::__future_base::_State_baseV2::_M_do_set()  :0
36 0x0000000000099ee8 pthread_mutexattr_setkind_np()  ???:0
37 0x0000000000181feb std::__future_base::_Deferred_state<std::thread::_Invoker<std::tuple<triton::core::TritonModel::PrepareInstances(inference::ModelConfig const&, std::vector<std::shared_ptr<triton::core::TritonModelInstance>, std::allocator<std::shared_ptr<triton::core::TritonModelInstance> > >*, std::vector<std::shared_ptr<triton::core::TritonModelInstance>, std::allocator<std::shared_ptr<triton::core::TritonModelInstance> > >*)::{lambda()#1}> >, triton::core::Status>::_M_complete_async()  backend_model.cc:0
38 0x0000000000191dc5 triton::core::TritonModel::PrepareInstances()  :0
39 0x0000000000196d36 triton::core::TritonModel::Create()  :0
40 0x0000000000287330 triton::core::ModelLifeCycle::CreateModel()  :0
41 0x000000000028aa23 std::_Function_handler<void (), triton::core::ModelLifeCycle::AsyncLoad(triton::core::ModelIdentifier const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, inference::ModelConfig const&, bool, bool, std::shared_ptr<triton::core::TritonRepoAgentModelList> const&, std::function<void (triton::core::Status)>&&)::{lambda()#2}>::_M_invoke()  model_lifecycle.cc:0
42 0x00000000003ded82 std::thread::_State_impl<std::thread::_Invoker<std::tuple<triton::common::ThreadPool::ThreadPool(unsigned long)::{lambda()#1}> > >::_M_run()  thread_pool.cc:0
43 0x00000000000dc253 std::error_code::default_error_condition()  ???:0
44 0x0000000000094ac3 pthread_condattr_setpshared()  ???:0
45 0x0000000000125814 clone()  ???:0
=================================
[ubuntu:00116] *** Process received signal ***
[ubuntu:00116] Signal: Segmentation fault (11)
[ubuntu:00116] Signal code:  (-6)
[ubuntu:00116] Failing at address: 0x74
[ubuntu:00116] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f287b38d520]
[ubuntu:00116] [ 1] /opt/hpcx/ucx/lib/libucs.so.0(ucs_arbiter_dispatch_nonempty+0x72)[0x7f284a553862]
[ubuntu:00116] [ 2] /opt/hpcx/ucx/lib/libuct.so.0(+0x1abfe)[0x7f284a4d5bfe]
[ubuntu:00116] [ 3] /opt/hpcx/ucx/lib/libucp.so.0(ucp_worker_progress+0x5a)[0x7f284a6d7a5a]
[ubuntu:00116] [ 4] /opt/hpcx/ompi/lib/libopen-pal.so.40(opal_progress+0x34)[0x7f28606a08f4]
[ubuntu:00116] [ 5] /opt/hpcx/ompi/lib/libopen-pal.so.40(ompi_sync_wait_mt+0xbd)[0x7f28606a72bd]
[ubuntu:00116] [ 6] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_request_default_wait+0x24b)[0x7f286089063b]
[ubuntu:00116] [ 7] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_coll_base_sendrecv_actual+0xd3)[0x7f28608cfc93]
[ubuntu:00116] [ 8] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_coll_base_allreduce_intra_recursivedoubling+0x298)[0x7f28608d12c8]
[ubuntu:00116] [ 9] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_coll_base_allreduce_intra_ring+0x8a1)[0x7f28608d29d1]
[ubuntu:00116] [10] /opt/hpcx/ompi/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_dec_fixed+0x4f)[0x7f2811f9208f]
[ubuntu:00116] [11] /opt/hpcx/ompi/lib/libmpi.so.40(PMPI_Allreduce+0x73)[0x7f28608a4a13]
[ubuntu:00116] [12] /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0xb675e)[0x7f27e495e75e]
[ubuntu:00116] [13] /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x9de98)[0x7f27e4945e98]
[ubuntu:00116] [14] /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0xf6a26)[0x7f27e499ea26]
[ubuntu:00116] [15] /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0xf8572)[0x7f27e49a0572]
[ubuntu:00116] [16] /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0xf8ab5)[0x7f27e49a0ab5]
[ubuntu:00116] [17] /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x90c39)[0x7f27e4938c39]
[ubuntu:00116] [18] /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x7107d)[0x7f27e491907d]
[ubuntu:00116] [19] /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x663b0)[0x7f27e490e3b0]
[ubuntu:00116] [20] /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x48fa1)[0x7f27e48f0fa1]
[ubuntu:00116] [21] /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x49fc2)[0x7f27e48f1fc2]
[ubuntu:00116] [22] /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(TRITONBACKEND_ModelInstanceInitialize+0x65)[0x7f27e48e3ad5]
[ubuntu:00116] [23] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a7226)[0x7f287bd89226]
[ubuntu:00116] [24] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a8466)[0x7f287bd8a466]
[ubuntu:00116] [25] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18b165)[0x7f287bd6d165]
[ubuntu:00116] [26] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18b7a6)[0x7f287bd6d7a6]
[ubuntu:00116] [27] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x197a1d)[0x7f287bd79a1d]
[ubuntu:00116] [28] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8)[0x7f287b3e4ee8]
[ubuntu:00116] [29] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x181feb)[0x7f287bd63feb]
[ubuntu:00116] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node ubuntu exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

triton server log: log.txt

my computer os version: Linux ubuntu 5.15.0-91-generic #101~20.04.1-Ubuntu SMP Thu Nov 16 14:22:28 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

gpu infos:

Wed Jan 31 09:14:45 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A800 80GB PCIe          Off | 00000000:34:00.0 Off |                    0 |
| N/A   42C    P0              66W / 300W |  13195MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A800 80GB PCIe          Off | 00000000:35:00.0 Off |                    0 |
| N/A   34C    P0              45W / 300W |      7MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A800 80GB PCIe          Off | 00000000:9D:00.0 Off |                    0 |
| N/A   33C    P0              43W / 300W |      7MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A800 80GB PCIe          Off | 00000000:9E:00.0 Off |                    0 |
| N/A   33C    P0              43W / 300W |      7MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

pip list:

Package                  Version
------------------------ -----------------
absl-py                  2.1.0
accelerate               0.25.0
aiohttp                  3.9.1
aiosignal                1.3.1
async-timeout            4.0.3
attrs                    23.2.0
blinker                  1.4
Brotli                   1.1.0
build                    1.0.3
certifi                  2023.11.17
cfgv                     3.4.0
charset-normalizer       3.3.2
click                    8.1.7
colored                  2.2.4
coloredlogs              15.0.1
coverage                 7.4.0
cryptography             3.4.8
cuda-python              12.3.0
datasets                 2.16.1
dbus-python              1.2.18
diffusers                0.15.0
dill                     0.3.7
distlib                  0.3.8
distro                   1.7.0
einops                   0.7.0
evaluate                 0.4.1
exceptiongroup           1.2.0
execnet                  2.0.2
filelock                 3.13.1
fire                     0.5.0
flatbuffers              23.5.26
frozenlist               1.4.1
fsspec                   2023.10.0
gevent                   23.9.1
geventhttpclient         2.0.2
graphviz                 0.20.1
greenlet                 3.0.3
grpcio                   1.60.0
httplib2                 0.20.2
huggingface-hub          0.20.3
humanfriendly            10.0
identify                 2.5.33
idna                     3.6
importlib-metadata       4.6.4
iniconfig                2.0.0
janus                    1.0.0
jeepney                  0.7.1
Jinja2                   3.1.3
joblib                   1.3.2
keyring                  23.5.0
lark                     1.1.9
launchpadlib             1.10.16
lazr.restfulclient       0.14.4
lazr.uri                 1.0.6
markdown-it-py           3.0.0
MarkupSafe               2.1.4
mdurl                    0.1.2
more-itertools           8.10.0
mpi4py                   3.1.5
mpmath                   1.3.0
multidict                6.0.4
multiprocess             0.70.15
mypy                     1.8.0
mypy-extensions          1.0.0
networkx                 3.2.1
ninja                    1.11.1.1
nltk                     3.8.1
nodeenv                  1.8.0
numpy                    1.26.2
nvidia-ammo              0.5.1
nvidia-cublas-cu12       12.1.3.1
nvidia-cuda-cupti-cu12   12.1.105
nvidia-cuda-nvrtc-cu12   12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12        8.9.2.26
nvidia-cufft-cu12        11.0.2.54
nvidia-curand-cu12       10.3.2.106
nvidia-cusolver-cu12     11.4.5.107
nvidia-cusparse-cu12     12.1.0.106
nvidia-nccl-cu12         2.18.1
nvidia-nvjitlink-cu12    12.3.101
nvidia-nvtx-cu12         12.1.105
oauthlib                 3.2.0
onnx                     1.15.0
onnx-graphsurgeon        0.3.25
onnxruntime              1.16.3
onnxsim                  0.4.35
optimum                  1.16.2
packaging                23.2
pandas                   2.2.0
parameterized            0.9.0
pillow                   10.2.0
pip                      23.3.1
platformdirs             4.1.0
pluggy                   1.4.0
polygraphy               0.48.1
pre-commit               3.6.0
protobuf                 4.25.2
psutil                   5.9.8
py                       1.11.0
pyarrow                  15.0.0
pyarrow-hotfix           0.6
pybind11                 2.11.1
pybind11-stubgen         2.4.2
Pygments                 2.17.2
PyGObject                3.42.1
PyJWT                    2.3.0
pynvml                   11.5.0
pyparsing                2.4.7
pyproject_hooks          1.0.0
pytest                   7.4.4
pytest-cov               4.1.0
pytest-forked            1.6.0
pytest-xdist             3.5.0
python-apt               2.4.0+ubuntu2
python-dateutil          2.8.2
python-rapidjson         1.14
pytz                     2023.3.post1
PyYAML                   6.0.1
regex                    2023.12.25
requests                 2.31.0
responses                0.18.0
rich                     13.7.0
rouge-score              0.1.2
safetensors              0.4.2
scipy                    1.12.0
SecretStorage            3.3.1
sentencepiece            0.1.99
setuptools               69.0.2
six                      1.16.0
sympy                    1.12
tabulate                 0.9.0
tensorrt                 9.2.0.post12.dev5
tensorrt-llm             0.8.0.dev20240123
termcolor                2.4.0
tokenizers               0.15.1
tomli                    2.0.1
torch                    2.1.2+cu121
torchprofile             0.0.4
torchvision              0.16.2+cu121
tqdm                     4.66.1
transformers             4.36.1
triton                   2.1.0
tritonclient             2.41.1
typing_extensions        4.8.0
tzdata                   2023.4
urllib3                  2.1.0
virtualenv               20.25.0
wadllib                  1.3.6
wheel                    0.42.0
xxhash                   3.4.1
yarl                     1.9.4
zipp                     1.0.0
zope.event               5.0
zope.interface           6.1