Open zhaoxjmail opened 9 months ago
Can you share the triton log? You can generate this by adding --log
to your launch_triton_server.py
command line.
Can you share the triton log? You can generate this by adding
--log
to yourlaunch_triton_server.py
command line.
this console log:
root@ubuntu:/tensorrtllm_backend# I1212 06:51:19.925554 173 pinned_memory_manager.cc:241] Pinned memory pool is created at '0x7fab4c000000' with size 268435456
I1212 06:51:19.943953 173 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I1212 06:51:19.943969 173 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864
I1212 06:51:19.943972 173 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864
I1212 06:51:19.943975 173 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864
I1212 06:51:20.764893 173 model_lifecycle.cc:461] loading: tensorrt_llm:1
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.85 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to true
[TensorRT-LLM][WARNING] exclude_input_in_output is not specified, will be set to false
[TensorRT-LLM][WARNING] max_kv_cache_length is not specified, will use default value
[TensorRT-LLM][WARNING] Parameter pipeline_parallel cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'pipeline_parallel' not found
[TensorRT-LLM][WARNING] Parameter num_kv_heads cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_kv_heads' not found
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_num_tokens' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set.
[TensorRT-LLM][WARNING] Parameter max_prompt_embedding_table_size cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_prompt_embedding_table_size' not found
[TensorRT-LLM][WARNING] Parameter gather_all_token_logits cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'gather_all_token_logits' not found
[TensorRT-LLM][WARNING] Parameter gather_all_token_logits cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'gather_all_token_logits' not found
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.85 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to true
[TensorRT-LLM][WARNING] exclude_input_in_output is not specified, will be set to false
[TensorRT-LLM][WARNING] max_kv_cache_length is not specified, will use default value
[TensorRT-LLM][WARNING] Parameter pipeline_parallel cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'pipeline_parallel' not found
[TensorRT-LLM][WARNING] Parameter num_kv_heads cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_kv_heads' not found
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_num_tokens' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set.
[TensorRT-LLM][WARNING] Parameter max_prompt_embedding_table_size cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_prompt_embedding_table_size' not found
[TensorRT-LLM][WARNING] Parameter gather_all_token_logits cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'gather_all_token_logits' not found
[TensorRT-LLM][WARNING] Parameter gather_all_token_logits cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'gather_all_token_logits' not found
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
[TensorRT-LLM][INFO] MPI size: 2, rank: 0
[TensorRT-LLM][INFO] MPI size: 2, rank: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 512
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 8
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 1
[TensorRT-LLM][INFO] Loaded engine size: 5829 MiB
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 512
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 8
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 1
[TensorRT-LLM][INFO] Loaded engine size: 5829 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 5897, GPU 6807 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 5898, GPU 6817 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 5897, GPU 6807 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 5898, GPU 6817 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +5823, now: CPU 0, GPU 5823 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +5823, now: CPU 0, GPU 5823 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 6131, GPU 8215 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 6131, GPU 8215 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 6131, GPU 8223 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 6131, GPU 8223 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 5823 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 5823 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 6177, GPU 8245 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 6177, GPU 8245 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 6178, GPU 8255 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 6178, GPU 8255 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 5823 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 6224, GPU 8275 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 5823 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 6224, GPU 8275 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 6224, GPU 8285 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 6224, GPU 8285 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 5823 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 5823 (MiB)
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.85 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to true
[TensorRT-LLM][WARNING] exclude_input_in_output is not specified, will be set to false
[TensorRT-LLM][WARNING] max_kv_cache_length is not specified, will use default value
[TensorRT-LLM][WARNING] Parameter pipeline_parallel cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'pipeline_parallel' not found
[TensorRT-LLM][WARNING] Parameter num_kv_heads cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_kv_heads' not found
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_num_tokens' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set.
[TensorRT-LLM][WARNING] Parameter max_prompt_embedding_table_size cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_prompt_embedding_table_size' not found
[TensorRT-LLM][WARNING] Parameter gather_all_token_logits cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'gather_all_token_logits' not found
[TensorRT-LLM][WARNING] Parameter gather_all_token_logits cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'gather_all_token_logits' not found
[TensorRT-LLM][INFO] MPI size: 2, rank: 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 512
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 8
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 1
[TensorRT-LLM][INFO] The logger passed into createInferRuntime differs from one already provided for an existing builder, runtime, or refitter. Uses of the global logger, returned by nvinfer1::getLogger(), will return the existing value.
[TensorRT-LLM][INFO] Loaded engine size: 5829 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 6280, GPU 17973 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 6280, GPU 17983 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +5824, now: CPU 0, GPU 11647 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +8, now: CPU 6281, GPU 19255 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 6281, GPU 19263 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 11647 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 6327, GPU 19281 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 6327, GPU 19291 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 11647 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 6373, GPU 19311 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 6373, GPU 19321 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 11647 (MiB)
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.85 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to true
[TensorRT-LLM][WARNING] exclude_input_in_output is not specified, will be set to false
[TensorRT-LLM][WARNING] max_kv_cache_length is not specified, will use default value
[TensorRT-LLM][WARNING] Parameter pipeline_parallel cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'pipeline_parallel' not found
[TensorRT-LLM][WARNING] Parameter num_kv_heads cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_kv_heads' not found
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_num_tokens' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set.
[TensorRT-LLM][WARNING] Parameter max_prompt_embedding_table_size cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_prompt_embedding_table_size' not found
[TensorRT-LLM][WARNING] Parameter gather_all_token_logits cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'gather_all_token_logits' not found
[TensorRT-LLM][WARNING] Parameter gather_all_token_logits cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'gather_all_token_logits' not found
[TensorRT-LLM][INFO] MPI size: 2, rank: 0
[ubuntu:172 :0:210] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x99)
[ubuntu:00172] *** Process received signal ***
[ubuntu:00172] Signal: Segmentation fault (11)
[ubuntu:00172] Signal code: Address not mapped (1)
[ubuntu:00172] Failing at address: 0x80000440
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node ubuntu exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
root@ubuntu:/tensorrtllm_backend# I1212 06:57:37.139275 219 pinned_memory_manager.cc:241] Pinned memory pool is created at '0x7f4dec000000' with size 268435456
I1212 06:57:37.154286 219 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I1212 06:57:37.154300 219 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864
I1212 06:57:37.154302 219 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864
I1212 06:57:37.154305 219 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864
I1212 06:57:37.948030 219 model_lifecycle.cc:461] loading: tensorrt_llm:1
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.85 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to true
[TensorRT-LLM][WARNING] exclude_input_in_output is not specified, will be set to false
[TensorRT-LLM][WARNING] max_kv_cache_length is not specified, will use default value
[TensorRT-LLM][WARNING] Parameter pipeline_parallel cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'pipeline_parallel' not found
[TensorRT-LLM][WARNING] Parameter num_kv_heads cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_kv_heads' not found
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_num_tokens' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set.
[TensorRT-LLM][WARNING] Parameter max_prompt_embedding_table_size cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_prompt_embedding_table_size' not found
[TensorRT-LLM][WARNING] Parameter gather_all_token_logits cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'gather_all_token_logits' not found
[TensorRT-LLM][WARNING] Parameter gather_all_token_logits cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'gather_all_token_logits' not found
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.85 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to true
[TensorRT-LLM][WARNING] exclude_input_in_output is not specified, will be set to false
[TensorRT-LLM][WARNING] max_kv_cache_length is not specified, will use default value
[TensorRT-LLM][WARNING] Parameter pipeline_parallel cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'pipeline_parallel' not found
[TensorRT-LLM][WARNING] Parameter num_kv_heads cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_kv_heads' not found
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_num_tokens' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set.
[TensorRT-LLM][WARNING] Parameter max_prompt_embedding_table_size cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_prompt_embedding_table_size' not found
[TensorRT-LLM][WARNING] Parameter gather_all_token_logits cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'gather_all_token_logits' not found
[TensorRT-LLM][WARNING] Parameter gather_all_token_logits cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'gather_all_token_logits' not found
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
[TensorRT-LLM][INFO] MPI size: 2, rank: 0
[TensorRT-LLM][INFO] MPI size: 2, rank: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 512
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 8
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 1
[TensorRT-LLM][INFO] Loaded engine size: 5829 MiB
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 512
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 8
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 1
[TensorRT-LLM][INFO] Loaded engine size: 5829 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 5897, GPU 6807 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 5898, GPU 6817 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 5897, GPU 6807 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 5898, GPU 6817 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +5823, now: CPU 0, GPU 5823 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +5823, now: CPU 0, GPU 5823 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 6131, GPU 8215 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 6131, GPU 8223 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 6131, GPU 8215 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 6131, GPU 8223 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 5823 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 5823 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 6177, GPU 8245 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 6177, GPU 8245 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 6178, GPU 8255 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 6178, GPU 8255 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 5823 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 6224, GPU 8275 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 6224, GPU 8285 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 5823 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 6224, GPU 8275 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 6224, GPU 8285 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 5823 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 5823 (MiB)
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.85 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to true
[TensorRT-LLM][WARNING] exclude_input_in_output is not specified, will be set to false
[TensorRT-LLM][WARNING] max_kv_cache_length is not specified, will use default value
[TensorRT-LLM][WARNING] Parameter pipeline_parallel cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'pipeline_parallel' not found
[TensorRT-LLM][WARNING] Parameter num_kv_heads cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_kv_heads' not found
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_num_tokens' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set.
[TensorRT-LLM][WARNING] Parameter max_prompt_embedding_table_size cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_prompt_embedding_table_size' not found
[TensorRT-LLM][WARNING] Parameter gather_all_token_logits cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'gather_all_token_logits' not found
[TensorRT-LLM][WARNING] Parameter gather_all_token_logits cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'gather_all_token_logits' not found
[TensorRT-LLM][INFO] MPI size: 2, rank: 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 512
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 8
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 1
[TensorRT-LLM][INFO] The logger passed into createInferRuntime differs from one already provided for an existing builder, runtime, or refitter. Uses of the global logger, returned by nvinfer1::getLogger(), will return the existing value.
[TensorRT-LLM][INFO] Loaded engine size: 5829 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 6280, GPU 17973 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 6280, GPU 17983 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +5824, now: CPU 0, GPU 11647 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +8, now: CPU 6281, GPU 19255 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 6281, GPU 19263 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 11647 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 6327, GPU 19281 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 6327, GPU 19291 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 11647 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 6373, GPU 19311 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 6373, GPU 19321 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 11647 (MiB)
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.85 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to true
[TensorRT-LLM][WARNING] exclude_input_in_output is not specified, will be set to false
[TensorRT-LLM][WARNING] max_kv_cache_length is not specified, will use default value
[TensorRT-LLM][WARNING] Parameter pipeline_parallel cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'pipeline_parallel' not found
[TensorRT-LLM][WARNING] Parameter num_kv_heads cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_kv_heads' not found
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_num_tokens' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set.
[TensorRT-LLM][WARNING] Parameter max_prompt_embedding_table_size cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_prompt_embedding_table_size' not found
[TensorRT-LLM][WARNING] Parameter gather_all_token_logits cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'gather_all_token_logits' not found
[TensorRT-LLM][WARNING] Parameter gather_all_token_logits cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'gather_all_token_logits' not found
[TensorRT-LLM][INFO] MPI size: 2, rank: 0
[ubuntu:218 :0:253] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x99)
==== backtrace (tid: 253) ====
0 0x0000000000042520 __sigaction() ???:0
1 0x000000000000628f mca_pml_ucx_isend() /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mca/pml/ucx/pml_ucx.c:862
2 0x000000000008e82b ompi_coll_base_bcast_intra_generic() /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mca/coll/base/coll_base_bcast.c:89
3 0x000000000008efc1 ompi_coll_base_bcast_intra_pipeline() /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mca/coll/base/coll_base_bcast.c:300
4 0x0000000000006840 ompi_coll_tuned_bcast_intra_dec_fixed() /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:649
5 0x0000000000069841 PMPI_Bcast() /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mpi/c/profile/pbcast.c:114
6 0x0000000000069841 PMPI_Bcast() /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mpi/c/profile/pbcast.c:41
7 0x00000000000a1b64 tensorrt_llm::mpi::bcast() :0
8 0x000000000004b4fc triton::backend::inflight_batcher_llm::ModelInstanceState::get_inference_requests() :0
9 0x000000000004bef7 std::_Function_handler<std::list<std::shared_ptr<tensorrt_llm::batch_manager::InferenceRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::InferenceRequest> > > (int), triton::backend::inflight_batcher_llm::ModelInstanceState::ModelInstanceState(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*)::{lambda(int)#1}>::_M_invoke() :0
10 0x0000000000061a64 tensorrt_llm::batch_manager::GptManager::fetchNewRequests() :0
11 0x0000000000064ac6 tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() :0
12 0x00000000000dc253 std::error_code::default_error_condition() ???:0
13 0x0000000000094ac3 pthread_condattr_setpshared() ???:0
14 0x0000000000125bf4 clone() ???:0
=================================
[ubuntu:00218] *** Process received signal ***
[ubuntu:00218] Signal: Segmentation fault (11)
[ubuntu:00218] Signal code: (-6)
[ubuntu:00218] Failing at address: 0xda
[ubuntu:00218] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f3885fa2520]
[ubuntu:00218] [ 1] /opt/hpcx/ompi/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_isend+0xbf)[0x7f387002628f]
[ubuntu:00218] [ 2] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_coll_base_bcast_intra_generic+0x14b)[0x7f38711bf82b]
[ubuntu:00218] [ 3] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_coll_base_bcast_intra_pipeline+0xd1)[0x7f38711bffc1]
[ubuntu:00218] [ 4] /opt/hpcx/ompi/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0x40)[0x7f3822898840]
[ubuntu:00218] [ 5] /opt/hpcx/ompi/lib/libmpi.so.40(MPI_Bcast+0x41)[0x7f387119a841]
[ubuntu:00218] [ 6] /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0xa1b64)[0x7f3793bbfb64]
[ubuntu:00218] [ 7] /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x4b4fc)[0x7f3793b694fc]
[ubuntu:00218] [ 8] /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x4bef7)[0x7f3793b69ef7]
[ubuntu:00218] [ 9] /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x61a64)[0x7f3793b7fa64]
[ubuntu:00218] [10] /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x64ac6)[0x7f3793b82ac6]
[ubuntu:00218] [11] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253)[0x7f3886264253]
[ubuntu:00218] [12] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3)[0x7f3885ff4ac3]
[ubuntu:00218] [13] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x44)[0x7f3886085bf4]
[ubuntu:00218] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node ubuntu exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Can you share the triton log? You can generate this by adding
--log
to yourlaunch_triton_server.py
command line.
@schetlur-nv What's the progress on this issue?
These error messages point to an incompatible engine. Can you share the steps you used to build the engine? Or it may be faster to try rebuilding the engine and deploying with the same version of TRT-LLM.
These error messages point to an incompatible engine. Can you share the steps you used to build the engine? Or it may be faster to try rebuilding the engine and deploying with the same version of TRT-LLM.
@schetlur-nv thank you!
I using V0.6.1
tensorrt 9.2.0.post12.dev5 tensorrt-llm 0.6.1
1.I refer to https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/installation.md build TensorRT-LLm
# TensorRT-LLM uses git-lfs, which needs to be installed in advance.
apt-get update && apt-get -y install git git-lfs
git clone --branch v0.6.1 https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM
git submodule update --init --recursive
git lfs install
git lfs pull
make -C docker release_build
2.I refer to https://github.com/triton-inference-server/tensorrtllm_backend#option-3-build-via-docker build tensorrtllm_backend
git clone --branch v0.6.1 https://github.com/triton-inference-server/tensorrtllm_backend.git
# Update the submodules
cd tensorrtllm_backend
git lfs install
git submodule update --init --recursive
# Use the Dockerfile to build the backend in a container
# For x86_64
DOCKER_BUILDKIT=1 docker build -t triton_trt_llm -f dockerfile/Dockerfile.trt_llm_backend .
Additionally, this issue only occurs when using tensor parallelism
Could you try v0.7.1 for TRT-LLM and tensorrtllm_backend? I was not able to reproduce the issue on this version with your configuration settings. If the error persists, could you share the steps you used to build the Bloom engine? This is the command I used:
trtllm-build --checkpoint_dir ./bloom/560M/trt_ckpt/fp16/2-gpu/ --use_gemm_plugin float16 --use_gpt_attention_plugin float16 --output_dir ./bloom/560M/trt_engines/fp16/2-gpu/ --paged_kv_cache --remove_input_padding
This is similar to the TRT-LLM example (https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/bloom#build-tensorrt-engines) but note the addition of the last 2 options.
Could you try v0.7.1 for TRT-LLM and tensorrtllm_backend? I was not able to reproduce the issue on this version with your configuration settings. If the error persists, could you share the steps you used to build the Bloom engine? This is the command I used:
trtllm-build --checkpoint_dir ./bloom/560M/trt_ckpt/fp16/2-gpu/ --use_gemm_plugin float16 --use_gpt_attention_plugin float16 --output_dir ./bloom/560M/trt_engines/fp16/2-gpu/ --paged_kv_cache --remove_input_padding
This is similar to the TRT-LLM example (https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/bloom#build-tensorrt-engines) but note the addition of the last 2 options.
@achartier Thanks very much! I using V0.61. follow is my script:
# Use 2-way tensor parallelism on BLOOM
python build.py --model_dir /code/tensorrt_llm/saved_model_1201 \
--dtype float16 \
--use_gemm_plugin float16 \
--use_gpt_attention_plugin float16 \
--use_weight_only \
--output_dir ./bloom/ \
--world_size 2
Would you be able to try v0.7.1? It looks like the issue is fixed with that version.
Would you be able to try v0.7.1? It looks like the issue is fixed with that version.
@achartier Thanks very much! I will try it
@achartier i using v0.7.1 but this error still exists.follow is console log:
root@ubuntu:/tensorrtllm_backend# python3 scripts/launch_triton_server.py --world_size=2 --model_repo=/tensorrtllm_backend/triton_model_repo --log --log-file=./log.txt
root@ubuntu:/tensorrtllm_backend# I0129 02:15:21.739898 525 pinned_memory_manager.cc:241] Pinned memory pool is created at '0x7f509e000000' with size 268435456
I0129 02:15:21.763659 525 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I0129 02:15:21.763671 525 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864
I0129 02:15:21.763674 525 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864
I0129 02:15:21.763677 525 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864
I0129 02:15:22.634696 525 model_lifecycle.cc:461] loading: tensorrt_llm:1
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.9 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] exclude_input_in_output is not specified, will be set to false
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be number, but is null
[TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.9 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] exclude_input_in_output is not specified, will be set to false
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be number, but is null
[TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
[TensorRT-LLM][INFO] MPI size: 2, rank: 1
[TensorRT-LLM][INFO] MPI size: 2, rank: 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 2048
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] Loaded engine size: 8706 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 8779, GPU 23263 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 8781, GPU 23273 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 2048
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] Loaded engine size: 8706 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 8779, GPU 9699 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 8781, GPU 9709 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +8702, now: CPU 0, GPU 8702 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +8702, now: CPU 0, GPU 8702 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 8935, GPU 10055 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 8935, GPU 10063 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 8935, GPU 23617 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 8935, GPU 23625 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 8702 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 8702 (MiB)
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.9 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] exclude_input_in_output is not specified, will be set to false
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be number, but is null
[TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set.
[TensorRT-LLM][INFO] MPI size: 2, rank: 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 2048
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] The logger passed into createInferRuntime differs from one already provided for an existing builder, runtime, or refitter. Uses of the global logger, returned by nvinfer1::getLogger(), will return the existing value.
[TensorRT-LLM][INFO] Loaded engine size: 8706 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 8985, GPU 32827 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 8985, GPU 32837 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +8702, now: CPU 0, GPU 17404 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 8985, GPU 33021 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 8985, GPU 33029 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 17404 (MiB)
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.9 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] exclude_input_in_output is not specified, will be set to false
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be number, but is null
[TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set.
[TensorRT-LLM][INFO] MPI size: 2, rank: 0
[ubuntu:524 :0:562] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x99)
==== backtrace (tid: 562) ====
0 0x0000000000042520 __sigaction() ???:0
1 0x000000000000628f mca_pml_ucx_isend() /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mca/pml/ucx/pml_ucx.c:862
2 0x000000000008e82b ompi_coll_base_bcast_intra_generic() /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mca/coll/base/coll_base_bcast.c:89
3 0x000000000008efc1 ompi_coll_base_bcast_intra_pipeline() /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mca/coll/base/coll_base_bcast.c:300
4 0x0000000000006840 ompi_coll_tuned_bcast_intra_dec_fixed() /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:649
5 0x0000000000069841 PMPI_Bcast() /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mpi/c/profile/pbcast.c:114
6 0x0000000000069841 PMPI_Bcast() /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mpi/c/profile/pbcast.c:41
7 0x00000000000b6087 tensorrt_llm::mpi::MpiComm::bcast() :0
8 0x0000000000046ee3 triton::backend::inflight_batcher_llm::ModelInstanceState::get_inference_requests() :0
9 0x00000000000477c7 std::_Function_handler<std::list<std::shared_ptr<tensorrt_llm::batch_manager::InferenceRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::InferenceRequest> > > (int), triton::backend::inflight_batcher_llm::ModelInstanceState::ModelInstanceState(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*)::{lambda(int)#1}>::_M_invoke() model_instance_state.cc:0
10 0x000000000006c304 tensorrt_llm::batch_manager::GptManager::fetchNewRequests() :0
11 0x000000000006e068 tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() :0
12 0x00000000000dc253 std::error_code::default_error_condition() ???:0
13 0x0000000000094ac3 pthread_condattr_setpshared() ???:0
14 0x0000000000125814 clone() ???:0
=================================
[ubuntu:00524] *** Process received signal ***
[ubuntu:00524] Signal: Segmentation fault (11)
[ubuntu:00524] Signal code: (-6)
[ubuntu:00524] Failing at address: 0x20c
[ubuntu:00524] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7fc184b8d520]
[ubuntu:00524] [ 1] /opt/hpcx/ompi/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_isend+0xbf)[0x7fc15803c28f]
[ubuntu:00524] [ 2] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_coll_base_bcast_intra_generic+0x14b)[0x7fc17023382b]
[ubuntu:00524] [ 3] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_coll_base_bcast_intra_pipeline+0xd1)[0x7fc170233fc1]
[ubuntu:00524] [ 4] /opt/hpcx/ompi/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0x40)[0x7fc15283e840]
[ubuntu:00524] [ 5] /opt/hpcx/ompi/lib/libmpi.so.40(MPI_Bcast+0x41)[0x7fc17020e841]
[ubuntu:00524] [ 6] /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0xb6087)[0x7fc08c95f087]
[ubuntu:00524] [ 7] /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x46ee3)[0x7fc08c8efee3]
[ubuntu:00524] [ 8] /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x477c7)[0x7fc08c8f07c7]
[ubuntu:00524] [ 9] /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x6c304)[0x7fc08c915304]
[ubuntu:00524] [10] /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x6e068)[0x7fc08c917068]
[ubuntu:00524] [11] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253)[0x7fc184e4f253]
[ubuntu:00524] [12] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3)[0x7fc184bdfac3]
[ubuntu:00524] [13] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x44)[0x7fc184c70814]
[ubuntu:00524] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node ubuntu exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
triton_server log:log.txt
ompi_info show:
Package: Open MPI root@hpcx-builder02 Distribution
Open MPI: 4.1.5rc2
Open MPI repo revision: v4.1.5rc1-17-gdb10576f40
Open MPI release date: Unreleased developer copy
Open RTE: 4.1.5rc2
Open RTE repo revision: v4.1.5rc1-17-gdb10576f40
Open RTE release date: Unreleased developer copy
OPAL: 4.1.5rc2
OPAL repo revision: v4.1.5rc1-17-gdb10576f40
OPAL release date: Unreleased developer copy
MPI API: 3.1.0
Ident string: 4.1.5rc2
Prefix: /opt/hpcx/ompi
Configured architecture: x86_64-pc-linux-gnu
Configure host: hpcx-builder02
Configured by: root
Configured on: Tue Aug 22 16:31:11 UTC 2023
Configure host: hpcx-builder02
Configure command line: '--prefix=/build-result/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi'
'--with-libevent=internal'
'--enable-mpi1-compatibility' '--without-xpmem'
'--with-cuda=/hpc/local/oss/cuda12.1.1'
'--with-slurm'
'--with-platform=contrib/platform/mellanox/optimized'
'--with-hcoll=/build-result/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/hcoll'
'--with-ucx=/build-result/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ucx'
'--with-ucc=/build-result/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ucc'
Built by:
Built on: Tue Aug 22 16:41:49 UTC 2023
Built host: hpcx-builder02
C bindings: yes
C++ bindings: no
Fort mpif.h: yes (all)
Fort use mpi: yes (full: ignore TKR)
Fort use mpi size: deprecated-ompi-info-value
Fort use mpi_f08: yes
Fort mpi_f08 compliance: The mpi_f08 module is available, but due to
limitations in the gfortran compiler and/or Open
MPI, does not support the following: array
subsections, direct passthru (where possible) to
underlying Open MPI's C functionality
Fort mpi_f08 subarrays: no
Java bindings: no
Wrapper compiler rpath: runpath
C compiler: gcc
C compiler absolute: /usr/bin/gcc
C compiler family name: GNU
C compiler version: 11.2.0
C++ compiler: g++
C++ compiler absolute: /usr/bin/g++
Fort compiler: gfortran
Fort compiler abs: /usr/bin/gfortran
Fort ignore TKR: yes (!GCC$ ATTRIBUTES NO_ARG_CHECK ::)
Fort 08 assumed shape: yes
Fort optional args: yes
Fort INTERFACE: yes
Fort ISO_FORTRAN_ENV: yes
Fort STORAGE_SIZE: yes
Fort BIND(C) (all): yes
Fort ISO_C_BINDING: yes
Fort SUBROUTINE BIND(C): yes
Fort TYPE,BIND(C): yes
Fort T,BIND(C,name="a"): yes
Fort PRIVATE: yes
Fort PROTECTED: yes
Fort ABSTRACT: yes
Fort ASYNCHRONOUS: yes
Fort PROCEDURE: yes
Fort USE...ONLY: yes
Fort C_FUNLOC: yes
Fort f08 using wrappers: yes
Fort MPI_SIZEOF: yes
C profiling: yes
C++ profiling: no
Fort mpif.h profiling: yes
Fort use mpi profiling: yes
Fort use mpi_f08 prof: yes
C++ exceptions: no
Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes,
OMPI progress: no, ORTE progress: yes, Event lib:
yes)
Sparse Groups: no
Internal debug support: no
MPI interface warnings: yes
MPI parameter check: never
Memory profiling support: no
Memory debugging support: no
dl support: yes
Heterogeneous support: no
mpirun default --prefix: yes
MPI_WTIME support: native
Symbol vis. support: yes
Host topology support: yes
IPv6 support: no
MPI1 compatibility: yes
MPI extensions: affinity, cuda, pcollreq
FT Checkpoint support: no (checkpoint thread: no)
C/R Enabled Debugging: no
MPI_MAX_PROCESSOR_NAME: 256
MPI_MAX_ERROR_STRING: 256
MPI_MAX_OBJECT_NAME: 64
MPI_MAX_INFO_KEY: 36
MPI_MAX_INFO_VAL: 256
MPI_MAX_PORT_NAME: 1024
MPI_MAX_DATAREP_STRING: 128
MCA allocator: basic (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA allocator: bucket (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA backtrace: execinfo (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA btl: vader (MCA v2.1.0, API v3.1.0, Component v4.1.5)
MCA btl: smcuda (MCA v2.1.0, API v3.1.0, Component v4.1.5)
MCA btl: self (MCA v2.1.0, API v3.1.0, Component v4.1.5)
MCA btl: tcp (MCA v2.1.0, API v3.1.0, Component v4.1.5)
MCA compress: gzip (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA compress: bzip (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA crs: none (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA dl: dlopen (MCA v2.1.0, API v1.0.0, Component v4.1.5)
MCA event: libevent2022 (MCA v2.1.0, API v2.0.0, Component
v4.1.5)
MCA hwloc: hwloc201 (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA if: linux_ipv6 (MCA v2.1.0, API v2.0.0, Component
v4.1.5)
MCA if: posix_ipv4 (MCA v2.1.0, API v2.0.0, Component
v4.1.5)
MCA installdirs: env (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA installdirs: config (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA memory: patcher (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA mpool: hugepage (MCA v2.1.0, API v3.0.0, Component v4.1.5)
MCA patcher: overwrite (MCA v2.1.0, API v1.0.0, Component
v4.1.5)
MCA pmix: flux (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA pmix: pmix3x (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA pmix: isolated (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA pmix: s2 (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA pstat: linux (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA rcache: gpusm (MCA v2.1.0, API v3.3.0, Component v4.1.5)
MCA rcache: grdma (MCA v2.1.0, API v3.3.0, Component v4.1.5)
MCA rcache: rgpusm (MCA v2.1.0, API v3.3.0, Component v4.1.5)
MCA reachable: netlink (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA reachable: weighted (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA shmem: mmap (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA shmem: sysv (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA shmem: posix (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA timer: linux (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA errmgr: default_orted (MCA v2.1.0, API v3.0.0, Component
v4.1.5)
MCA errmgr: default_tool (MCA v2.1.0, API v3.0.0, Component
v4.1.5)
MCA errmgr: default_app (MCA v2.1.0, API v3.0.0, Component
v4.1.5)
MCA errmgr: default_hnp (MCA v2.1.0, API v3.0.0, Component
v4.1.5)
MCA ess: pmi (MCA v2.1.0, API v3.0.0, Component v4.1.5)
MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component v4.1.5)
MCA ess: hnp (MCA v2.1.0, API v3.0.0, Component v4.1.5)
MCA ess: env (MCA v2.1.0, API v3.0.0, Component v4.1.5)
MCA ess: singleton (MCA v2.1.0, API v3.0.0, Component
v4.1.5)
MCA ess: tool (MCA v2.1.0, API v3.0.0, Component v4.1.5)
MCA filem: raw (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA grpcomm: direct (MCA v2.1.0, API v3.0.0, Component v4.1.5)
MCA iof: orted (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA iof: hnp (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA iof: tool (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA odls: default (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA odls: pspawn (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA oob: tcp (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA plm: rsh (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA plm: slurm (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA plm: isolated (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA ras: simulator (MCA v2.1.0, API v2.0.0, Component
v4.1.5)
MCA regx: fwd (MCA v2.1.0, API v1.0.0, Component v4.1.5)
MCA regx: reverse (MCA v2.1.0, API v1.0.0, Component v4.1.5)
MCA regx: naive (MCA v2.1.0, API v1.0.0, Component v4.1.5)
MCA rmaps: resilient (MCA v2.1.0, API v2.0.0, Component
v4.1.5)
MCA rmaps: rank_file (MCA v2.1.0, API v2.0.0, Component
v4.1.5)
MCA rmaps: round_robin (MCA v2.1.0, API v2.0.0, Component
v4.1.5)
MCA rmaps: ppr (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA rmaps: seq (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA rmaps: mindist (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA rml: oob (MCA v2.1.0, API v3.0.0, Component v4.1.5)
MCA routed: binomial (MCA v2.1.0, API v3.0.0, Component v4.1.5)
MCA routed: radix (MCA v2.1.0, API v3.0.0, Component v4.1.5)
MCA routed: direct (MCA v2.1.0, API v3.0.0, Component v4.1.5)
MCA rtc: hwloc (MCA v2.1.0, API v1.0.0, Component v4.1.5)
MCA schizo: flux (MCA v2.1.0, API v1.0.0, Component v4.1.5)
MCA schizo: jsm (MCA v2.1.0, API v1.0.0, Component v4.1.5)
MCA schizo: orte (MCA v2.1.0, API v1.0.0, Component v4.1.5)
MCA schizo: ompi (MCA v2.1.0, API v1.0.0, Component v4.1.5)
MCA schizo: slurm (MCA v2.1.0, API v1.0.0, Component v4.1.5)
MCA state: hnp (MCA v2.1.0, API v1.0.0, Component v4.1.5)
MCA state: tool (MCA v2.1.0, API v1.0.0, Component v4.1.5)
MCA state: orted (MCA v2.1.0, API v1.0.0, Component v4.1.5)
MCA state: novm (MCA v2.1.0, API v1.0.0, Component v4.1.5)
MCA state: app (MCA v2.1.0, API v1.0.0, Component v4.1.5)
MCA bml: r2 (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA coll: cuda (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA coll: libnbc (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA coll: hcoll (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA coll: han (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA coll: sync (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA coll: ucc (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA coll: monitoring (MCA v2.1.0, API v2.0.0, Component
v4.1.5)
MCA coll: basic (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA coll: tuned (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA coll: adapt (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA coll: inter (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA coll: self (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA coll: sm (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA fbtl: posix (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA fcoll: dynamic_gen2 (MCA v2.1.0, API v2.0.0, Component
v4.1.5)
MCA fcoll: dynamic (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA fcoll: two_phase (MCA v2.1.0, API v2.0.0, Component
v4.1.5)
MCA fcoll: vulcan (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA fcoll: individual (MCA v2.1.0, API v2.0.0, Component
v4.1.5)
MCA fs: ufs (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA io: romio321 (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA io: ompio (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA op: avx (MCA v2.1.0, API v1.0.0, Component v4.1.5)
MCA osc: pt2pt (MCA v2.1.0, API v3.0.0, Component v4.1.5)
MCA osc: monitoring (MCA v2.1.0, API v3.0.0, Component
v4.1.5)
MCA osc: ucx (MCA v2.1.0, API v3.0.0, Component v4.1.5)
MCA osc: sm (MCA v2.1.0, API v3.0.0, Component v4.1.5)
MCA osc: rdma (MCA v2.1.0, API v3.0.0, Component v4.1.5)
MCA pml: v (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA pml: ob1 (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA pml: ucx (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA pml: cm (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA pml: monitoring (MCA v2.1.0, API v2.0.0, Component
v4.1.5)
MCA rte: orte (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA sharedfp: sm (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA sharedfp: individual (MCA v2.1.0, API v2.0.0, Component
v4.1.5)
MCA sharedfp: lockedfile (MCA v2.1.0, API v2.0.0, Component
v4.1.5)
MCA topo: basic (MCA v2.1.0, API v2.2.0, Component v4.1.5)
MCA topo: treematch (MCA v2.1.0, API v2.2.0, Component
v4.1.5)
MCA vprotocol: pessimist (MCA v2.1.0, API v2.0.0, Component
mpi4py version 3.1.5
Sorry, I have not been able to reproduce the issue yet. Meanwhile, could you try the following steps:
--paged_kv_cache --remove_input_padding
to the engine build--use_weight_only
from the engine buildSorry, I have not been able to reproduce the issue yet. Meanwhile, could you try the following steps:
- Add
--paged_kv_cache --remove_input_padding
to the engine build- Remove
--use_weight_only
from the engine build- Combination of 1 and 2
ok,thanks i will try it soon
@achartier this error still exists. below is script and log:
#Use 4-way tensor parallelism on BLOOM
python convert_checkpoint.py --model_dir ./saved_model_1201 \
--dtype float16 \
--output_dir ./bloom/trt_ckpt/fp16/4-gpu/ \
--world_size 4
trtllm-build --checkpoint_dir ./bloom/trt_ckpt/fp16/4-gpu/ \
--use_gemm_plugin float16 \
--use_gpt_attention_plugin float16 \
--paged_kv_cache \
--remove_input_padding \
--output_dir ./bloom/trt_engines/fp16/4-gpu/
root@ubuntu:/tensorrtllm_backend# python3 scripts/launch_triton_server.py --world_size=4 --model_repo=/tensorrtllm_backend/triton_model_repo --log --log-file=./log.txt
root@ubuntu:/tensorrtllm_backend# I0131 00:16:48.740042 118 pinned_memory_manager.cc:241] Pinned memory pool is created at '0x7f0e4e000000' with size 268435456
I0131 00:16:48.740084 119 pinned_memory_manager.cc:241] Pinned memory pool is created at '0x7ff4de000000' with size 268435456
I0131 00:16:48.741732 117 pinned_memory_manager.cc:241] Pinned memory pool is created at '0x7fbb14000000' with size 268435456
I0131 00:16:48.761204 119 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I0131 00:16:48.761215 119 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864
I0131 00:16:48.761218 119 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864
I0131 00:16:48.761221 119 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864
I0131 00:16:48.762586 118 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I0131 00:16:48.762600 118 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864
I0131 00:16:48.762603 118 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864
I0131 00:16:48.762606 118 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864
I0131 00:16:48.768432 117 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I0131 00:16:48.768446 117 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864
I0131 00:16:48.768450 117 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864
I0131 00:16:48.768453 117 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864
I0131 00:16:51.370436 118 model_lifecycle.cc:461] loading: tensorrt_llm:1
I0131 00:16:51.371312 119 model_lifecycle.cc:461] loading: tensorrt_llm:1
I0131 00:16:51.393087 117 model_lifecycle.cc:461] loading: tensorrt_llm:1
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.9 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] exclude_input_in_output is not specified, will be set to false
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be number, but is null
[TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.9 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] exclude_input_in_output is not specified, will be set to false
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be number, but is null
[TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.9 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] exclude_input_in_output is not specified, will be set to false
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be number, but is null
[TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.9 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] exclude_input_in_output is not specified, will be set to false
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be number, but is null
[TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
[TensorRT-LLM][INFO] MPI size: 4, rank: 1
[TensorRT-LLM][INFO] MPI size: 4, rank: 0
[TensorRT-LLM][INFO] MPI size: 4, rank: 3
[TensorRT-LLM][INFO] MPI size: 4, rank: 2
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 2048
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 2048
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] Loaded engine size: 5335 MiB
[TensorRT-LLM][INFO] Loaded engine size: 5335 MiB
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 2048
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] Loaded engine size: 5335 MiB
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 2048
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] Loaded engine size: 5335 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 5407, GPU 7296 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 5407, GPU 7296 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 5409, GPU 7306 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 5409, GPU 7306 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 5407, GPU 7296 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 5409, GPU 7306 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 5407, GPU 7296 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 5409, GPU 7306 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +5331, now: CPU 0, GPU 5331 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +5331, now: CPU 0, GPU 5331 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +5331, now: CPU 0, GPU 5331 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +5331, now: CPU 0, GPU 5331 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 5563, GPU 7516 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 5563, GPU 7516 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 5563, GPU 7516 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 5564, GPU 7524 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 5564, GPU 7524 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 5563, GPU 7516 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 5564, GPU 7524 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 5564, GPU 7524 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 5331 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 5331 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 5331 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 5331 (MiB)
[TensorRT-LLM][INFO] Allocate 69394759680 bytes for k/v cache.
[TensorRT-LLM][INFO] Using 564736 tokens in paged KV cache.
[TensorRT-LLM][INFO] Allocate 69394759680 bytes for k/v cache.
[TensorRT-LLM][INFO] Using 564736 tokens in paged KV cache.
[TensorRT-LLM][INFO] Allocate 69394759680 bytes for k/v cache.
[TensorRT-LLM][INFO] Using 564736 tokens in paged KV cache.
[TensorRT-LLM][INFO] Allocate 69394759680 bytes for k/v cache.
[TensorRT-LLM][INFO] Using 564736 tokens in paged KV cache.
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.9 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] exclude_input_in_output is not specified, will be set to false
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be number, but is null
[TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set.
[TensorRT-LLM][INFO] MPI size: 4, rank: 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 2048
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] The logger passed into createInferRuntime differs from one already provided for an existing builder, runtime, or refitter. Uses of the global logger, returned by nvinfer1::getLogger(), will return the existing value.
[TensorRT-LLM][INFO] Loaded engine size: 5335 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 5610, GPU 79052 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 5610, GPU 79062 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +5332, now: CPU 0, GPU 10663 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 5610, GPU 79182 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 5610, GPU 79190 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 10663 (MiB)
[ubuntu:116 :0:148] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid: 148) ====
0 0x0000000000042520 __sigaction() ???:0
1 0x0000000000051862 ucs_list_del() /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ucx-bf8f1b66767231605126806d837cba26d1b12afa/src/ucs/datastruct/list.h:105
2 0x0000000000051862 ucs_arbiter_dispatch_nonempty() /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ucx-bf8f1b66767231605126806d837cba26d1b12afa/src/ucs/datastruct/arbiter.c:284
3 0x000000000001abfe ucs_arbiter_dispatch() /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ucx-bf8f1b66767231605126806d837cba26d1b12afa/src/ucs/datastruct/arbiter.h:386
4 0x000000000001abfe uct_mm_iface_progress() /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ucx-bf8f1b66767231605126806d837cba26d1b12afa/src/uct/sm/mm/base/mm_iface.c:388
5 0x000000000004ea5a ucs_callbackq_dispatch() /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ucx-bf8f1b66767231605126806d837cba26d1b12afa/src/ucs/datastruct/callbackq.h:211
6 0x000000000004ea5a uct_worker_progress() /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ucx-bf8f1b66767231605126806d837cba26d1b12afa/src/uct/api/uct.h:2777
7 0x000000000004ea5a ucp_worker_progress() /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ucx-bf8f1b66767231605126806d837cba26d1b12afa/src/ucp/core/ucp_worker.c:2885
8 0x000000000003a8f4 opal_progress() /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/opal/runtime/opal_progress.c:231
9 0x00000000000412bd ompi_sync_wait_mt() /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/opal/threads/wait_sync.c:85
10 0x000000000005463b ompi_request_wait_completion() /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/../ompi/request/request.h:428
11 0x000000000005463b ompi_request_default_wait() /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/request/req_wait.c:42
12 0x0000000000093c93 ompi_coll_base_sendrecv_actual() /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mca/coll/base/coll_base_util.c:62
13 0x00000000000952c8 ompi_coll_base_allreduce_intra_recursivedoubling() /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mca/coll/base/coll_base_allreduce.c:219
14 0x00000000000969d1 ompi_coll_base_allreduce_intra_ring() /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mca/coll/base/coll_base_allreduce.c:373
15 0x000000000000608f ompi_coll_tuned_allreduce_intra_dec_fixed() /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:216
16 0x0000000000068a13 PMPI_Allreduce() /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mpi/c/profile/pallreduce.c:113
17 0x0000000000068a13 opal_obj_update() /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mpi/c/profile/../../../../opal/class/opal_object.h:534
18 0x0000000000068a13 PMPI_Allreduce() /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mpi/c/profile/pallreduce.c:116
19 0x0000000000068a13 PMPI_Allreduce() /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mpi/c/profile/pallreduce.c:46
20 0x00000000000b675e tensorrt_llm::mpi::MpiComm::allreduce() :0
21 0x000000000009de98 tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager::getMaxNumTokens() :0
22 0x00000000000f6a26 tensorrt_llm::runtime::GptSession::createKvCacheManager() :0
23 0x00000000000f8572 tensorrt_llm::runtime::GptSession::setup() :0
24 0x00000000000f8ab5 tensorrt_llm::runtime::GptSession::GptSession() :0
25 0x0000000000090c39 tensorrt_llm::batch_manager::TrtGptModelV1::TrtGptModelV1() :0
26 0x000000000007107d tensorrt_llm::batch_manager::TrtGptModelFactory::create() :0
27 0x00000000000663b0 tensorrt_llm::batch_manager::GptManager::GptManager() :0
28 0x0000000000048fa1 triton::backend::inflight_batcher_llm::ModelInstanceState::ModelInstanceState() :0
29 0x0000000000049fc2 triton::backend::inflight_batcher_llm::ModelInstanceState::Create() :0
30 0x000000000003bad5 TRITONBACKEND_ModelInstanceInitialize() ???:0
31 0x00000000001a7226 triton::core::TritonModelInstance::ConstructAndInitializeInstance() :0
32 0x00000000001a8466 triton::core::TritonModelInstance::CreateInstance() :0
33 0x000000000018b165 triton::core::TritonModel::PrepareInstances(inference::ModelConfig const&, std::vector<std::shared_ptr<triton::core::TritonModelInstance>, std::allocator<std::shared_ptr<triton::core::TritonModelInstance> > >*, std::vector<std::shared_ptr<triton::core::TritonModelInstance>, std::allocator<std::shared_ptr<triton::core::TritonModelInstance> > >*)::{lambda()#1}::operator()() backend_model.cc:0
34 0x000000000018b7a6 std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<triton::core::Status>, std::__future_base::_Result_base::_Deleter>, std::thread::_Invoker<std::tuple<triton::core::TritonModel::PrepareInstances(inference::ModelConfig const&, std::vector<std::shared_ptr<triton::core::TritonModelInstance>, std::allocator<std::shared_ptr<triton::core::TritonModelInstance> > >*, std::vector<std::shared_ptr<triton::core::TritonModelInstance>, std::allocator<std::shared_ptr<triton::core::TritonModelInstance> > >*)::{lambda()#1}> >, triton::core::Status> >::_M_invoke() backend_model.cc:0
35 0x0000000000197a1d std::__future_base::_State_baseV2::_M_do_set() :0
36 0x0000000000099ee8 pthread_mutexattr_setkind_np() ???:0
37 0x0000000000181feb std::__future_base::_Deferred_state<std::thread::_Invoker<std::tuple<triton::core::TritonModel::PrepareInstances(inference::ModelConfig const&, std::vector<std::shared_ptr<triton::core::TritonModelInstance>, std::allocator<std::shared_ptr<triton::core::TritonModelInstance> > >*, std::vector<std::shared_ptr<triton::core::TritonModelInstance>, std::allocator<std::shared_ptr<triton::core::TritonModelInstance> > >*)::{lambda()#1}> >, triton::core::Status>::_M_complete_async() backend_model.cc:0
38 0x0000000000191dc5 triton::core::TritonModel::PrepareInstances() :0
39 0x0000000000196d36 triton::core::TritonModel::Create() :0
40 0x0000000000287330 triton::core::ModelLifeCycle::CreateModel() :0
41 0x000000000028aa23 std::_Function_handler<void (), triton::core::ModelLifeCycle::AsyncLoad(triton::core::ModelIdentifier const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, inference::ModelConfig const&, bool, bool, std::shared_ptr<triton::core::TritonRepoAgentModelList> const&, std::function<void (triton::core::Status)>&&)::{lambda()#2}>::_M_invoke() model_lifecycle.cc:0
42 0x00000000003ded82 std::thread::_State_impl<std::thread::_Invoker<std::tuple<triton::common::ThreadPool::ThreadPool(unsigned long)::{lambda()#1}> > >::_M_run() thread_pool.cc:0
43 0x00000000000dc253 std::error_code::default_error_condition() ???:0
44 0x0000000000094ac3 pthread_condattr_setpshared() ???:0
45 0x0000000000125814 clone() ???:0
=================================
[ubuntu:00116] *** Process received signal ***
[ubuntu:00116] Signal: Segmentation fault (11)
[ubuntu:00116] Signal code: (-6)
[ubuntu:00116] Failing at address: 0x74
[ubuntu:00116] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f287b38d520]
[ubuntu:00116] [ 1] /opt/hpcx/ucx/lib/libucs.so.0(ucs_arbiter_dispatch_nonempty+0x72)[0x7f284a553862]
[ubuntu:00116] [ 2] /opt/hpcx/ucx/lib/libuct.so.0(+0x1abfe)[0x7f284a4d5bfe]
[ubuntu:00116] [ 3] /opt/hpcx/ucx/lib/libucp.so.0(ucp_worker_progress+0x5a)[0x7f284a6d7a5a]
[ubuntu:00116] [ 4] /opt/hpcx/ompi/lib/libopen-pal.so.40(opal_progress+0x34)[0x7f28606a08f4]
[ubuntu:00116] [ 5] /opt/hpcx/ompi/lib/libopen-pal.so.40(ompi_sync_wait_mt+0xbd)[0x7f28606a72bd]
[ubuntu:00116] [ 6] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_request_default_wait+0x24b)[0x7f286089063b]
[ubuntu:00116] [ 7] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_coll_base_sendrecv_actual+0xd3)[0x7f28608cfc93]
[ubuntu:00116] [ 8] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_coll_base_allreduce_intra_recursivedoubling+0x298)[0x7f28608d12c8]
[ubuntu:00116] [ 9] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_coll_base_allreduce_intra_ring+0x8a1)[0x7f28608d29d1]
[ubuntu:00116] [10] /opt/hpcx/ompi/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_dec_fixed+0x4f)[0x7f2811f9208f]
[ubuntu:00116] [11] /opt/hpcx/ompi/lib/libmpi.so.40(PMPI_Allreduce+0x73)[0x7f28608a4a13]
[ubuntu:00116] [12] /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0xb675e)[0x7f27e495e75e]
[ubuntu:00116] [13] /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x9de98)[0x7f27e4945e98]
[ubuntu:00116] [14] /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0xf6a26)[0x7f27e499ea26]
[ubuntu:00116] [15] /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0xf8572)[0x7f27e49a0572]
[ubuntu:00116] [16] /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0xf8ab5)[0x7f27e49a0ab5]
[ubuntu:00116] [17] /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x90c39)[0x7f27e4938c39]
[ubuntu:00116] [18] /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x7107d)[0x7f27e491907d]
[ubuntu:00116] [19] /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x663b0)[0x7f27e490e3b0]
[ubuntu:00116] [20] /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x48fa1)[0x7f27e48f0fa1]
[ubuntu:00116] [21] /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x49fc2)[0x7f27e48f1fc2]
[ubuntu:00116] [22] /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(TRITONBACKEND_ModelInstanceInitialize+0x65)[0x7f27e48e3ad5]
[ubuntu:00116] [23] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a7226)[0x7f287bd89226]
[ubuntu:00116] [24] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a8466)[0x7f287bd8a466]
[ubuntu:00116] [25] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18b165)[0x7f287bd6d165]
[ubuntu:00116] [26] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18b7a6)[0x7f287bd6d7a6]
[ubuntu:00116] [27] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x197a1d)[0x7f287bd79a1d]
[ubuntu:00116] [28] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8)[0x7f287b3e4ee8]
[ubuntu:00116] [29] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x181feb)[0x7f287bd63feb]
[ubuntu:00116] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node ubuntu exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
triton server log: log.txt
my computer os version: Linux ubuntu 5.15.0-91-generic #101~20.04.1-Ubuntu SMP Thu Nov 16 14:22:28 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
gpu infos:
Wed Jan 31 09:14:45 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A800 80GB PCIe Off | 00000000:34:00.0 Off | 0 |
| N/A 42C P0 66W / 300W | 13195MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A800 80GB PCIe Off | 00000000:35:00.0 Off | 0 |
| N/A 34C P0 45W / 300W | 7MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A800 80GB PCIe Off | 00000000:9D:00.0 Off | 0 |
| N/A 33C P0 43W / 300W | 7MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A800 80GB PCIe Off | 00000000:9E:00.0 Off | 0 |
| N/A 33C P0 43W / 300W | 7MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
pip list:
Package Version
------------------------ -----------------
absl-py 2.1.0
accelerate 0.25.0
aiohttp 3.9.1
aiosignal 1.3.1
async-timeout 4.0.3
attrs 23.2.0
blinker 1.4
Brotli 1.1.0
build 1.0.3
certifi 2023.11.17
cfgv 3.4.0
charset-normalizer 3.3.2
click 8.1.7
colored 2.2.4
coloredlogs 15.0.1
coverage 7.4.0
cryptography 3.4.8
cuda-python 12.3.0
datasets 2.16.1
dbus-python 1.2.18
diffusers 0.15.0
dill 0.3.7
distlib 0.3.8
distro 1.7.0
einops 0.7.0
evaluate 0.4.1
exceptiongroup 1.2.0
execnet 2.0.2
filelock 3.13.1
fire 0.5.0
flatbuffers 23.5.26
frozenlist 1.4.1
fsspec 2023.10.0
gevent 23.9.1
geventhttpclient 2.0.2
graphviz 0.20.1
greenlet 3.0.3
grpcio 1.60.0
httplib2 0.20.2
huggingface-hub 0.20.3
humanfriendly 10.0
identify 2.5.33
idna 3.6
importlib-metadata 4.6.4
iniconfig 2.0.0
janus 1.0.0
jeepney 0.7.1
Jinja2 3.1.3
joblib 1.3.2
keyring 23.5.0
lark 1.1.9
launchpadlib 1.10.16
lazr.restfulclient 0.14.4
lazr.uri 1.0.6
markdown-it-py 3.0.0
MarkupSafe 2.1.4
mdurl 0.1.2
more-itertools 8.10.0
mpi4py 3.1.5
mpmath 1.3.0
multidict 6.0.4
multiprocess 0.70.15
mypy 1.8.0
mypy-extensions 1.0.0
networkx 3.2.1
ninja 1.11.1.1
nltk 3.8.1
nodeenv 1.8.0
numpy 1.26.2
nvidia-ammo 0.5.1
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-nccl-cu12 2.18.1
nvidia-nvjitlink-cu12 12.3.101
nvidia-nvtx-cu12 12.1.105
oauthlib 3.2.0
onnx 1.15.0
onnx-graphsurgeon 0.3.25
onnxruntime 1.16.3
onnxsim 0.4.35
optimum 1.16.2
packaging 23.2
pandas 2.2.0
parameterized 0.9.0
pillow 10.2.0
pip 23.3.1
platformdirs 4.1.0
pluggy 1.4.0
polygraphy 0.48.1
pre-commit 3.6.0
protobuf 4.25.2
psutil 5.9.8
py 1.11.0
pyarrow 15.0.0
pyarrow-hotfix 0.6
pybind11 2.11.1
pybind11-stubgen 2.4.2
Pygments 2.17.2
PyGObject 3.42.1
PyJWT 2.3.0
pynvml 11.5.0
pyparsing 2.4.7
pyproject_hooks 1.0.0
pytest 7.4.4
pytest-cov 4.1.0
pytest-forked 1.6.0
pytest-xdist 3.5.0
python-apt 2.4.0+ubuntu2
python-dateutil 2.8.2
python-rapidjson 1.14
pytz 2023.3.post1
PyYAML 6.0.1
regex 2023.12.25
requests 2.31.0
responses 0.18.0
rich 13.7.0
rouge-score 0.1.2
safetensors 0.4.2
scipy 1.12.0
SecretStorage 3.3.1
sentencepiece 0.1.99
setuptools 69.0.2
six 1.16.0
sympy 1.12
tabulate 0.9.0
tensorrt 9.2.0.post12.dev5
tensorrt-llm 0.8.0.dev20240123
termcolor 2.4.0
tokenizers 0.15.1
tomli 2.0.1
torch 2.1.2+cu121
torchprofile 0.0.4
torchvision 0.16.2+cu121
tqdm 4.66.1
transformers 4.36.1
triton 2.1.0
tritonclient 2.41.1
typing_extensions 4.8.0
tzdata 2023.4
urllib3 2.1.0
virtualenv 20.25.0
wadllib 1.3.6
wheel 0.42.0
xxhash 3.4.1
yarl 1.9.4
zipp 1.0.0
zope.event 5.0
zope.interface 6.1
I build engine Use 2-way tensor parallelism on BLOOM 7B,then i start docker:
docker run --rm -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /home/project/tensorrtllm_backend:/tensorrtllm_backend triton_trt_llm:latest bash
in docker i run:
python scripts/launch_triton_server.py --world_size=2 --model_repo=/tensorrtllm_backend/triton_model_repo