How to deploy one model instance across multiple GPUs to tackle the OOM problem?

shil3754 commented 1 month ago

I am trying to deploy a Baichuan2-7B model on a machine with 2 Tesla V100 GPUs. Unfortunately each V100 has only 16GB memory. I have applied INT8 weight-only quantization, so the size of the engine I get is about 8GB. I have also set --world_size to 2 to use 2-way tensor parallelism.

But when I try to start the triton server, I always get the Ouf of Memory error. It seems that one instance will be lauched in each GPU, but there is not enough memory in either of them. I know that 32GB memory combined is enough to deploy the model as I have done that on another machine, but I don't know how to deploy the model with 2 GPUs.

Can anyone help?

byshiue commented 3 weeks ago

Have you referred the document here about multi-GPU serving?

shil3754 commented 2 weeks ago

Have you referred the document here about multi-GPU serving?

Yes, I have. I just cannot launch the triton server. This is how I build the engines:

python /tensorrtllm_backend/tensorrt_llm/examples/baichuan/build.py --model_version v2_7b \
                --model_dir $model_dir \
                --max_batch_size 2 \
                --max_input_len 512\
                --dtype float16 \
                --use_inflight_batching \
                --use_gpt_attention_plugin float16 \
                --paged_kv_cache \
                --use_gemm_plugin float16 \
                --use_weight_only \
                --world_size 2 \
                --output_dir /tensorrtllm_backend/triton_model_repo/tensorrt_llm/1

The engines work perfectly fine when I run:

python /tensorrtllm_backend/tensorrt_llm/examples/run.py --input_text "Can you tell me about Euro 2024?" \
                 --max_output_len=500 \
                 --tokenizer_dir=/tensorrtllm_backend/triton_model_repo/tensorrt_llm/1 \
                 --engine_dir=/tensorrtllm_backend/triton_model_repo/tensorrt_llm/1

But I just cannot launch the triton server by the following command:

python /tensorrtllm_backend/scripts/launch_triton_server.py --world_size=2 --model_repo=/tensorrtllm_backend/triton_model_repo

It always has the OOM error. I have decreased max_batch_size from 16 to 2 and max_input_len from 2048 to 512, but the problem persists. In theory I have enough GPU memory for deploying one model, but it seems that there are two models launched (before crashing due to OOM) because there are two GPUs. Is there a way to launch just one model across two GPUs?

byshiue commented 2 weeks ago

Please check your max_tokens_in_paged_kv_cache and kv_cache_free_gpu_mem_fraction. They affect the memory usage, too. You could see more details in modify-the-model-configuration

Also, for

python /tensorrtllm_backend/tensorrt_llm/examples/run.py --input_text "Can you tell me about Euro 2024?" \
                 --max_output_len=500 \
                 --tokenizer_dir=/tensorrtllm_backend/triton_model_repo/tensorrt_llm/1 \
                 --engine_dir=/tensorrtllm_backend/triton_model_repo/tensorrt_llm/1

it looks you run the example with only single GPU, is it typo?

And, please share what branch/version of TRT-LLM do you use. We also encourage take a try on latest main branch because we change some default settings recently.

shil3754 commented 2 weeks ago

I tried many different numbers for kv_cache_free_gpu_mem_fraction, none worked. I haven't changed max_tokens_in_paged_kv_cache though. What is the suggested value for it? You are right that I missed mpirun -n 2 --allow-run-as-root \ before python /tensorrtllm_backend/tensorrt_llm/examples/run.py in my previous post. I did use two GPUs to run the model, and there were indeed two engine files. There are the versions of tensorrtllm_backend and TensorRT-LLM I am using.

byshiue commented 3 days ago

I suggest you use a smaller problem size first, like bs 1, input length 128, output length 128, and setting max_tokens_in_paged_kv_cache as 256, ignoring the kv_cache_free_gpu_mem_fraction first.

shil3754 commented 3 days ago

I tried the settings you suggested and it somehow still doesn't work. I now get the Segmentation fault. The following is the error log:

['mpirun', '--allow-run-as-root', '-n', '1', '/opt/tritonserver/bin/tritonserver', '--grpc-port=8001', '--http-port=8000', '--metrics-port=8002', '--model-repository=/tensorrtllm_backend/triton_modelrepo', '--disable-auto-complete-config', '--backend-config=python,shm-region-prefix-name=prefix0', ':', '-n', '1', '/opt/tritonserver/bin/tritonserver', '--grpc-port=8001', '--http-port=8000', '--metrics-port=8002', '--model-repository=/tensorrtllm_backend/triton_modelrepo', '--disable-auto-complete-config', '--backend-config=python,shm-region-prefix-name=prefix1', ':']

root@VM-247-myos:/tensorrtllm_backend/triton_model_repo/tensorrt_llm/1# I0626 08:53:25.348604 8324 pinned_memory_manager.cc:241] Pinned memory pool is created at '0x14804c000000' with size 268435456 I0626 08:53:25.348775 8323 pinned_memory_manager.cc:241] Pinned memory pool is created at '0x14c776000000' with size 268435456 I0626 08:53:25.359529 8324 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864 I0626 08:53:25.359547 8324 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864 I0626 08:53:25.359817 8323 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864 I0626 08:53:25.359831 8323 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864 W0626 08:53:25.659773 8323 server.cc:238] failed to enable peer access for some device pairs W0626 08:53:25.662517 8324 server.cc:238] failed to enable peer access for some device pairs I0626 08:53:25.662628 8323 model_lifecycle.cc:461] loading: postprocessing:1 I0626 08:53:25.662690 8323 model_lifecycle.cc:461] loading: preprocessing:1 W0626 08:53:25.662887 8323 model_lifecycle.cc:108] ignore version directory 'v2_7B_int8_bs1_input128' which fails to convert to integral number W0626 08:53:25.662922 8323 model_lifecycle.cc:108] ignore version directory 'v2_7B_int8_bs2_input1024' which fails to convert to integral number I0626 08:53:25.662961 8323 model_lifecycle.cc:461] loading: tensorrt_llm:1 I0626 08:53:25.663023 8323 model_lifecycle.cc:461] loading: tensorrt_llm_bls:1 I0626 08:53:25.665255 8324 model_lifecycle.cc:461] loading: postprocessing:1 I0626 08:53:25.665313 8324 model_lifecycle.cc:461] loading: preprocessing:1 W0626 08:53:25.665454 8324 model_lifecycle.cc:108] ignore version directory 'v2_7B_int8_bs1_input128' which fails to convert to integral number W0626 08:53:25.665493 8324 model_lifecycle.cc:108] ignore version directory 'v2_7B_int8_bs2_input1024' which fails to convert to integral number I0626 08:53:25.665528 8324 model_lifecycle.cc:461] loading: tensorrt_llm:1 I0626 08:53:25.665591 8324 model_lifecycle.cc:461] loading: tensorrt_llm_bls:1 I0626 08:53:25.683767 8323 python_be.cc:2199] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (GPU device 1) I0626 08:53:25.683770 8323 python_be.cc:2199] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (GPU device 0) I0626 08:53:25.683826 8323 python_be.cc:2199] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (GPU device 0) I0626 08:53:25.684297 8323 python_be.cc:2199] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (GPU device 1) I0626 08:53:25.686153 8324 python_be.cc:2199] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (GPU device 0) I0626 08:53:25.686189 8324 python_be.cc:2199] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (GPU device 0) I0626 08:53:25.686193 8324 python_be.cc:2199] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (GPU device 1) I0626 08:53:25.686244 8324 python_be.cc:2199] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (GPU device 1) I0626 08:53:25.766392 8324 python_be.cc:2199] TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_0 (GPU device 0) I0626 08:53:25.766453 8324 python_be.cc:2199] TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_0 (GPU device 1) I0626 08:53:25.770785 8323 python_be.cc:2199] TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_0 (GPU device 0) I0626 08:53:25.770844 8323 python_be.cc:2199] TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_0 (GPU device 1) [TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict) [TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.85 or max_tokens_in_paged_kv_cache [TensorRT-LLM][WARNING] max_num_sequences is not specified, will be set to the TRT engine max_batch_size [TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length) [TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false [TensorRT-LLM][WARNING] Parameter version cannot be read from json: [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'version' not found [TensorRT-LLM][WARNING] Parameter pipeline_parallel cannot be read from json: [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'pipeline_parallel' not found [TensorRT-LLM][WARNING] Parameter num_kv_heads cannot be read from json: [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_kv_heads' not found [TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json: [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be number, but is null [TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set. [TensorRT-LLM][WARNING] Parameter max_prompt_embedding_table_size cannot be read from json: [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_prompt_embedding_table_size' not found [TensorRT-LLM][WARNING] Parameter gather_all_token_logits cannot be read from json: [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'gather_all_token_logits' not found [TensorRT-LLM][WARNING] Parameter gather_all_token_logits cannot be read from json: [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'gather_all_token_logits' not found [TensorRT-LLM][INFO] Initializing MPI with thread mode 1 [TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict) [TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.85 or max_tokens_in_paged_kv_cache [TensorRT-LLM][WARNING] max_num_sequences is not specified, will be set to the TRT engine max_batch_size [TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length) [TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false [TensorRT-LLM][WARNING] Parameter version cannot be read from json: [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'version' not found [TensorRT-LLM][WARNING] Parameter pipeline_parallel cannot be read from json: [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'pipeline_parallel' not found [TensorRT-LLM][WARNING] Parameter num_kv_heads cannot be read from json: [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_kv_heads' not found [TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json: [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be number, but is null [TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set. [TensorRT-LLM][WARNING] Parameter max_prompt_embedding_table_size cannot be read from json: [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_prompt_embedding_table_size' not found [TensorRT-LLM][WARNING] Parameter gather_all_token_logits cannot be read from json: [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'gather_all_token_logits' not found [TensorRT-LLM][WARNING] Parameter gather_all_token_logits cannot be read from json: [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'gather_all_token_logits' not found [TensorRT-LLM][INFO] Initializing MPI with thread mode 1 [TensorRT-LLM][INFO] MPI size: 2, rank: 1 [TensorRT-LLM][INFO] MPI size: 2, rank: 0 I0626 08:53:26.207877 8324 model_lifecycle.cc:818] successfully loaded 'tensorrt_llm_bls' I0626 08:53:26.246268 8323 model_lifecycle.cc:818] successfully loaded 'tensorrt_llm_bls' I0626 08:53:26.593198 8324 model_lifecycle.cc:818] successfully loaded 'preprocessing' I0626 08:53:26.595729 8323 model_lifecycle.cc:818] successfully loaded 'preprocessing' I0626 08:53:26.599854 8323 model_lifecycle.cc:818] successfully loaded 'postprocessing' I0626 08:53:26.626931 8324 model_lifecycle.cc:818] successfully loaded 'postprocessing' [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 2 [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1 [TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 256 [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 1 [TensorRT-LLM][WARNING] Device 0 peer access Device 1 is not available. [TensorRT-LLM][INFO] Loaded engine size: 4567 MiB [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 4620, GPU 5328 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 4622, GPU 5338 (MiB) [TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4 [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 2 [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1 [TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 256 [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 1 [TensorRT-LLM][WARNING] Device 1 peer access Device 0 is not available. [TensorRT-LLM][INFO] Loaded engine size: 4567 MiB [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 4620, GPU 5328 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 4622, GPU 5338 (MiB) [TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4 [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +4563, now: CPU 0, GPU 4563 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +4563, now: CPU 0, GPU 4563 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 4853, GPU 5470 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 4853, GPU 5470 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 4854, GPU 5478 (MiB) [TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4 [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 4854, GPU 5478 (MiB) [TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4 [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 4563 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 4563 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 4887, GPU 5498 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 4887, GPU 5498 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 4887, GPU 5508 (MiB) [TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4 [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 4887, GPU 5508 (MiB) [TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4 [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 4563 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 4563 (MiB) [TensorRT-LLM][INFO] Allocate 67108864 bytes for k/v cache. [TensorRT-LLM][INFO] Using 256 total tokens in paged KV cache, and 2 blocks per sequence [TensorRT-LLM][INFO] Allocate 67108864 bytes for k/v cache. [TensorRT-LLM][INFO] Using 256 total tokens in paged KV cache, and 2 blocks per sequence [TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict) [TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.85 or max_tokens_in_paged_kv_cache [TensorRT-LLM][WARNING] max_num_sequences is not specified, will be set to the TRT engine max_batch_size [TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length) [TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false [TensorRT-LLM][WARNING] Parameter version cannot be read from json: [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'version' not found [TensorRT-LLM][WARNING] Parameter pipeline_parallel cannot be read from json: [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'pipeline_parallel' not found [TensorRT-LLM][WARNING] Parameter num_kv_heads cannot be read from json: [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_kv_heads' not found [TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json: [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be number, but is null [TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set. [TensorRT-LLM][WARNING] Parameter max_prompt_embedding_table_size cannot be read from json: [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_prompt_embedding_table_size' not found [TensorRT-LLM][WARNING] Parameter gather_all_token_logits cannot be read from json: [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'gather_all_token_logits' not found [TensorRT-LLM][WARNING] Parameter gather_all_token_logits cannot be read from json: [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'gather_all_token_logits' not found [TensorRT-LLM][INFO] MPI size: 2, rank: 0 [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 2 [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1 [TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 256 [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 1 [TensorRT-LLM][INFO] The logger passed into createInferRuntime differs from one already provided for an existing builder, runtime, or refitter. Uses of the global logger, returned by nvinfer1::getLogger(), will return the existing value. [TensorRT-LLM][INFO] Loaded engine size: 4567 MiB [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 4928, GPU 10156 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +12, now: CPU 4928, GPU 10168 (MiB) [TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4 [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +4563, now: CPU 0, GPU 9126 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 4928, GPU 10192 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 4929, GPU 10200 (MiB) [TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4 [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 9126 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +8, now: CPU 4962, GPU 10220 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 4962, GPU 10230 (MiB) [TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4 [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 9126 (MiB) [VM-247-myos:8323 :0:8333] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil)) ==== backtrace (tid: 8333) ==== 0 0x0000000000042520 sigaction() ???:0 1 0x0000000000051862 ucs_list_del() /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ucx-bf8f1b66767231605126806d837cba26d1b12afa/src/ucs/datastruct/list.h:105 2 0x0000000000051862 ucs_arbiter_dispatch_nonempty() /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ucx-bf8f1b66767231605126806d837cba26d1b12afa/src/ucs/datastruct/arbiter.c:284 3 0x000000000001abfe ucs_arbiter_dispatch() /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ucx-bf8f1b66767231605126806d837cba26d1b12afa/src/ucs/datastruct/arbiter.h:386 4 0x000000000001abfe uct_mm_iface_progress() /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ucx-bf8f1b66767231605126806d837cba26d1b12afa/src/uct/sm/mm/base/mm_iface.c:388 5 0x000000000004ea5a ucs_callbackq_dispatch() /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ucx-bf8f1b66767231605126806d837cba26d1b12afa/src/ucs/datastruct/callbackq.h:211 6 0x000000000004ea5a uct_worker_progress() /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ucx-bf8f1b66767231605126806d837cba26d1b12afa/src/uct/api/uct.h:2777 7 0x000000000004ea5a ucp_worker_progress() /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ucx-bf8f1b66767231605126806d837cba26d1b12afa/src/ucp/core/ucp_worker.c:2885 8 0x000000000003a8f4 opal_progress() /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/opal/runtime/opal_progress.c:231 9 0x00000000000412bd ompi_sync_wait_mt() /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/opal/threads/wait_sync.c:85 10 0x000000000005463b ompi_request_wait_completion() /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/../ompi/request/request.h:428 11 0x000000000005463b ompi_request_default_wait() /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/request/req_wait.c:42 12 0x0000000000093c93 ompi_coll_base_sendrecv_actual() /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mca/coll/base/coll_base_util.c:62 13 0x00000000000952c8 ompi_coll_base_allreduce_intra_recursivedoubling() /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mca/coll/base/coll_base_allreduce.c:219 14 0x000000000000608f ompi_coll_tuned_allreduce_intra_dec_fixed() /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:216 15 0x0000000000068a13 PMPI_Allreduce() /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mpi/c/profile/pallreduce.c:113 16 0x0000000000068a13 opal_obj_update() /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mpi/c/profile/../../../../opal/class/opal_object.h:534 17 0x0000000000068a13 PMPI_Allreduce() /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mpi/c/profile/pallreduce.c:116 18 0x0000000000068a13 PMPI_Allreduce() /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mpi/c/profile/pallreduce.c:46 19 0x00000000000af57b tensorrt_llm::mpi::allreduce() :0 20 0x000000000009bf9e tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager::getMaxNumTokens() :0 21 0x000000000007f6b0 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching() :0 22 0x000000000006f57f tensorrt_llm::batch_manager::TrtGptModelFactory::create() :0 23 0x0000000000066378 tensorrt_llm::batch_manager::GptManager::GptManager() :0 24 0x0000000000047edc triton::backend::inflight_batcher_llm::ModelInstanceState::ModelInstanceState() :0 25 0x0000000000048f82 triton::backend::inflight_batcher_llm::ModelInstanceState::Create() :0 26 0x0000000000038dd5 TRITONBACKEND_ModelInstanceInitialize() ???:0 27 0x00000000001a4a86 triton::core::TritonModelInstance::ConstructAndInitializeInstance() :0 28 0x00000000001a5cc6 triton::core::TritonModelInstance::CreateInstance() :0 29 0x0000000000188c15 triton::core::TritonModel::PrepareInstances(inference::ModelConfig const&, std::vector<std::shared_ptrtriton::core::TritonModelInstance, std::allocator >, std::vector<std::shared_ptrtriton::core::TritonModelInstance, std::allocator >)::{lambda()#1}::operator()() backend_model.cc:0 30 0x0000000000189256 std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::future_base::_Result_base::_Deleter> (), std::future_base::_Task_setter<std::unique_ptr<std::future_base::_Resulttriton::core::Status, std::future_base::_Result_base::_Deleter>, std::thread::_Invoker<std::tuple<triton::core::TritonModel::PrepareInstances(inference::ModelConfig const&, std::vector<std::shared_ptrtriton::core::TritonModelInstance, std::allocator >, std::vector<std::shared_ptrtriton::core::TritonModelInstance, std::allocator >)::{lambda()#1}> >, triton::core::Status> >::_M_invoke() backend_model.cc:0 31 0x000000000019527d std::future_base::_State_baseV2::_M_do_set() :0 32 0x0000000000099ee8 pthread_mutexattr_setkind_np() ???:0 33 0x000000000017f97b std::future_base::_Deferred_state<std::thread::_Invoker<std::tuple<triton::core::TritonModel::PrepareInstances(inference::ModelConfig const&, std::vector<std::shared_ptrtriton::core::TritonModelInstance, std::allocator >, std::vector<std::shared_ptrtriton::core::TritonModelInstance, std::allocator >)::{lambda()#1}> >, triton::core::Status>::_M_complete_async() backend_model.cc:0 34 0x000000000018f695 triton::core::TritonModel::PrepareInstances() :0 35 0x000000000019450b triton::core::TritonModel::Create() :0 36 0x000000000027d610 triton::core::ModelLifeCycle::CreateModel() :0 37 0x0000000000280d03 std::_Function_handler<void (), triton::core::ModelLifeCycle::AsyncLoad(triton::core::ModelIdentifier const&, std::cxx11::basic_string<char, std::char_traits, std::allocator > const&, inference::ModelConfig const&, bool, bool, std::shared_ptrtriton::core::TritonRepoAgentModelList const&, std::function<void (triton::core::Status)>&&)::{lambda()#2}>::_M_invoke() model_lifecycle.cc:0 38 0x00000000003cd8b2 std::thread::_State_impl<std::thread::_Invoker<std::tuple<triton::common::ThreadPool::ThreadPool(unsigned long)::{lambda()#1}> > >::_M_run() thread_pool.cc:0 39 0x00000000000dc253 std::error_code::default_error_condition() ???:0 40 0x0000000000094ac3 pthread_condattr_setpshared() ???:0 41 0x0000000000125bf4 clone() ???:0

[VM-247-myos:08323] Process received signal [VM-247-myos:08323] Signal: Segmentation fault (11) [VM-247-myos:08323] Signal code: (-6) [VM-247-myos:08323] Failing at address: 0x2083 [VM-247-myos:08323] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x14c7bfba2520] [VM-247-myos:08323] [ 1] /opt/hpcx/ucx/lib/libucs.so.0(ucs_arbiter_dispatch_nonempty+0x72)[0x14c6ad78b862] [VM-247-myos:08323] [ 2] /opt/hpcx/ucx/lib/libuct.so.0(+0x1abfe)[0x14c6e27c5bfe] [VM-247-myos:08323] [ 3] /opt/hpcx/ucx/lib/libucp.so.0(ucp_worker_progress+0x5a)[0x14c6ad90fa5a] [VM-247-myos:08323] [ 4] /opt/hpcx/ompi/lib/libopen-pal.so.40(opal_progress+0x34)[0x14c7b0ad58f4] [VM-247-myos:08323] [ 5] /opt/hpcx/ompi/lib/libopen-pal.so.40(ompi_sync_wait_mt+0xbd)[0x14c7b0adc2bd] [VM-247-myos:08323] [ 6] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_request_default_wait+0x24b)[0x14c7b0cc563b] [VM-247-myos:08323] [ 7] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_coll_base_sendrecv_actual+0xd3)[0x14c7b0d04c93] [VM-247-myos:08323] [ 8] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_coll_base_allreduce_intra_recursivedoubling+0x298)[0x14c7b0d062c8] [VM-247-myos:08323] [ 9] /opt/hpcx/ompi/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_dec_fixed+0x4f)[0x14c6ac7d608f] [VM-247-myos:08323] [10] /opt/hpcx/ompi/lib/libmpi.so.40(PMPI_Allreduce+0x73)[0x14c7b0cd9a13] [VM-247-myos:08323] [11] /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0xaf57b)[0x14c73f53657b] [VM-247-myos:08323] [12] /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x9bf9e)[0x14c73f522f9e] [VM-247-myos:08323] [13] /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x7f6b0)[0x14c73f5066b0] [VM-247-myos:08323] [14] /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x6f57f)[0x14c73f4f657f] [VM-247-myos:08323] [15] /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x66378)[0x14c73f4ed378] [VM-247-myos:08323] [16] /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x47edc)[0x14c73f4ceedc] [VM-247-myos:08323] [17] /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x48f82)[0x14c73f4cff82] [VM-247-myos:08323] [18] /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(TRITONBACKEND_ModelInstanceInitialize+0x65)[0x14c73f4bfdd5] [VM-247-myos:08323] [19] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a4a86)[0x14c7c059aa86] [VM-247-myos:08323] [20] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a5cc6)[0x14c7c059bcc6] [VM-247-myos:08323] [21] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x188c15)[0x14c7c057ec15] [VM-247-myos:08323] [22] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x189256)[0x14c7c057f256] [VM-247-myos:08323] [23] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19527d)[0x14c7c058b27d] [VM-247-myos:08323] [24] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8)[0x14c7bfbf9ee8] [VM-247-myos:08323] [25] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x17f97b)[0x14c7c057597b] [VM-247-myos:08323] [26] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18f695)[0x14c7c0585695] [VM-247-myos:08323] [27] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19450b)[0x14c7c058a50b] [VM-247-myos:08323] [28] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x27d610)[0x14c7c0673610] [VM-247-myos:08323] [29] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x280d03)[0x14c7c0676d03] [VM-247-myos:08323] End of error message

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. I0626 08:54:06.181069 8978 pb_stub.cc:1815] Non-graceful termination detected. I0626 08:54:06.182021 8974 pb_stub.cc:1815] Non-graceful termination detected. I0626 08:54:06.263976 8379 pb_stub.cc:1815] Non-graceful termination detected. I0626 08:54:06.267357 8381 pb_stub.cc:1815] Non-graceful termination detected. I0626 08:54:06.273681 8386 pb_stub.cc:1815] Non-graceful termination detected. I0626 08:54:06.277705 8399 pb_stub.cc:1815] Non-graceful termination detected. mpirun noticed that process rank 0 with PID 0 on node VM-247-myos exited on signal 11 (Segmentation fault).

triton-inference-server / tensorrtllm_backend

How to deploy one model instance across multiple GPUs to tackle the OOM problem? #462