triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
669 stars 97 forks source link

launch multi-gpu triton server and got an Error #524

Open dwq370 opened 3 months ago

dwq370 commented 3 months ago

System Info

4*NVIDIA L20

Who can help?

No response

Information

Tasks

Reproduction

step1: create a container docker run -it --gpus '"device=0,1,2,3"' --ipc=host --network=host --shm-size=16g --ulimit memlock=-1 --ulimit stack=67108864 \ -v /home/qxzg/workspace/tensorrt-llm-triton/workload:/workspace -v /home/qxzg/workspace/tensorrt-llm-triton/data:/models \ -v /home/qxzg/.cache/modelscope/hub/qwen/Qwen2-72B-Instruct:/models/hf_models/qwen2-instruct-72b \ -p 19000:9000 \ triton_tensorrtllm_backend:v0.10.0 bash

step2: set environment variable export HF_MODEL_DIR=/models/hf_models/qwen2-instruct-72b export TMP_CHECKPOINT_DIR=/models/trt_output/checkpoints/Qwen2-72B/ export TRT_ENGINE_DIR=/models/trt_output/engines/Qwen2-72B/trt_engines/fp16/tp4 export MAX_BATCH_SIZE=1 mkdir -p ${TRT_ENGINE_DIR} ${TMP_CHECKPOINT_DIR}

step3: convert model python /workspace/tensorrtllm_backend/tensorrt_llm/examples/qwen/convert_checkpoint.py \ --model_dir ${HF_MODEL_DIR} \ --dtype float16 \ --output_dir ${TMP_CHECKPOINT_DIR} \ --tp_size 4

step4: build engine trtllm-build --checkpoint_dir $TMP_CHECKPOINT_DIR \ --output_dir $TRT_ENGINE_DIR \ --gemm_plugin float16 \ --strongly_typed \ --paged_kv_cache enable \ --remove_input_padding enable \ --max_input_len=16384 \
--max_output_len=512 \ --max_batch_size 1 \ --tp_size=4

get the engine path as: root@ubuntu-H3C-UniServer-R4900-G5:/app# ll /models/trt_output/engines/Qwen2-72B/trt_engines/fp16/tp4/ total 149414208 drwxr-xr-x 2 root root 4096 Jul 5 05:47 ./ drwxr-xr-x 3 root root 4096 Jul 5 03:02 ../ -rw-r--r-- 1 root root 5637 Jul 5 04:59 config.json -rw-r--r-- 1 root root 38250028580 Jul 5 05:00 rank0.engine -rw-r--r-- 1 root root 38250028580 Jul 5 05:03 rank1.engine -rw-r--r-- 1 root root 38250028580 Jul 5 05:06 rank2.engine -rw-r--r-- 1 root root 38250028580 Jul 5 05:09 rank3.engine

step5: create model repository cp -rf /workspace/tensorrtllm_backend/all_models/inflight_batcher_llm/ /app/qwen_ifb cp -rf /workspace/tensorrtllm_backend/tools/fill_template.py /app/

python3 /app/fill_template.py -i /app/qwen_ifb/preprocessing/config.pbtxt tokenizer_dir:${HF_MODEL_DIR},tokenizer_type:auto,triton_max_batch_size:${MAX_BATCH_SIZE},preprocessing_instance_count:1 python3 /app/fill_template.py -i /app/qwen_ifb/postprocessing/config.pbtxt tokenizer_dir:${HF_MODEL_DIR},tokenizer_type:auto,triton_max_batch_size:${MAX_BATCH_SIZE},postprocessing_instance_count:1 python3 /app/fill_template.py -i /app/qwen_ifb/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:True,bls_instance_count:1,accumulate_tokens:False python3 /app/fill_template.py -i /app/qwen_ifb/ensemble/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE} python3 /app/fill_template.py -i /app/qwen_ifb/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:True,max_beam_width:1,engine_dir:${TRT_ENGINE_DIR},max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_batching,max_queue_delay_microseconds:600

step6: launch triton server python3 /app/scripts/launch_triton_server.py --world_size 4 --model_repo=/app/qwen_ifb/ --http_port 18000 --grpc_port 18001 --metrics_port 18002 --log --log-file ./triton_log.txt

Expected behavior

create triton server success

actual behavior

[TensorRT-LLM][INFO] Initializing MPI with thread mode 3 [TensorRT-LLM][INFO] Initialized MPI [TensorRT-LLM][INFO] Initialized MPI [TensorRT-LLM][INFO] Initialized MPI [TensorRT-LLM][INFO] Initialized MPI [TensorRT-LLM][INFO] MPI size: 4, rank: 2 [TensorRT-LLM][INFO] MPI size: 4, rank: 3 [TensorRT-LLM][INFO] MPI size: 4, rank: 0 [TensorRT-LLM][INFO] MPI size: 4, rank: 1 Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [TensorRT-LLM][INFO] Rank 3 is using GPU 3 [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1 [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1 [TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 2560 [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0 [TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1 [TensorRT-LLM][INFO] Loaded engine size: 36478 MiB [TensorRT-LLM][INFO] Rank 2 is using GPU 2 [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1 [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1 [TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 2560 [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0 [TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1 [TensorRT-LLM][INFO] Loaded engine size: 36478 MiB [TensorRT-LLM][INFO] Rank 0 is using GPU 0 [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1 [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1 [TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 2560 [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0 [TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1 [TensorRT-LLM][INFO] Loaded engine size: 36478 MiB [TensorRT-LLM][INFO] Rank 1 is using GPU 1 [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1 [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1 [TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 2560 [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0 [TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1 [TensorRT-LLM][INFO] Loaded engine size: 36478 MiB [TensorRT-LLM][INFO] Detecting local TP group for rank 2 [TensorRT-LLM][INFO] Detecting local TP group for rank 0 [TensorRT-LLM][INFO] Detecting local TP group for rank 1 [TensorRT-LLM][INFO] Detecting local TP group for rank 3 [TensorRT-LLM][INFO] TP group is intra-node for rank 2 [TensorRT-LLM][INFO] TP group is intra-node for rank 1 [TensorRT-LLM][INFO] TP group is intra-node for rank 3 [TensorRT-LLM][INFO] TP group is intra-node for rank 0 [TensorRT-LLM][INFO] Allocated 1312.07 MiB for execution context memory. [TensorRT-LLM][INFO] Allocated 1312.07 MiB for execution context memory. [TensorRT-LLM][INFO] Allocated 1312.07 MiB for execution context memory. [TensorRT-LLM][INFO] Allocated 1312.07 MiB for execution context memory. [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 36468 (MiB) [TensorRT-LLM][WARNING] Both freeGpuMemoryFraction (aka kv_cache_free_gpu_mem_fraction) and maxTokens (aka max_tokens_in_paged_kv_cache) are set (to 0.500000 and 2560, respectively). The smaller value will be used. [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 36468 (MiB) [TensorRT-LLM][WARNING] Both freeGpuMemoryFraction (aka kv_cache_free_gpu_mem_fraction) and maxTokens (aka max_tokens_in_paged_kv_cache) are set (to 0.500000 and 2560, respectively). The smaller value will be used. [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 36468 (MiB) [TensorRT-LLM][WARNING] Both freeGpuMemoryFraction (aka kv_cache_free_gpu_mem_fraction) and maxTokens (aka max_tokens_in_paged_kv_cache) are set (to 0.500000 and 2560, respectively). The smaller value will be used. [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 36468 (MiB) [TensorRT-LLM][WARNING] Both freeGpuMemoryFraction (aka kv_cache_free_gpu_mem_fraction) and maxTokens (aka max_tokens_in_paged_kv_cache) are set (to 0.500000 and 2560, respectively). The smaller value will be used. [TensorRT-LLM][INFO] Max KV cache pages per sequence: 40 [TensorRT-LLM][INFO] Max KV cache pages per sequence: 40 [TensorRT-LLM][INFO] Max KV cache pages per sequence: 40 [TensorRT-LLM][INFO] Max KV cache pages per sequence: 40 [TensorRT-LLM][INFO] Max tokens in paged KV cache: 2560. Allocating 209715200 bytes. [TensorRT-LLM][INFO] Max tokens in paged KV cache: 2560. Allocating 209715200 bytes. [TensorRT-LLM][INFO] Max tokens in paged KV cache: 2560. Allocating 209715200 bytes. [TensorRT-LLM][INFO] Max tokens in paged KV cache: 2560. Allocating 209715200 bytes. [TensorRT-LLM][WARNING] cancellation_check_period_ms is not specified, will be set to 100 (ms) [TensorRT-LLM][WARNING] stats_check_period_ms is not specified, will be set to 100 (ms) [TensorRT-LLM][WARNING] cancellation_check_period_ms is not specified, will be set to 100 (ms) [TensorRT-LLM][WARNING] stats_check_period_ms is not specified, will be set to 100 (ms) [TensorRT-LLM][WARNING] cancellation_check_period_ms is not specified, will be set to 100 (ms) [TensorRT-LLM][WARNING] stats_check_period_ms is not specified, will be set to 100 (ms) [TensorRT-LLM][WARNING] cancellation_check_period_ms is not specified, will be set to 100 (ms) [TensorRT-LLM][WARNING] stats_check_period_ms is not specified, will be set to 100 (ms) I0705 08:07:40.119730 79387 model_lifecycle.cc:835] successfully loaded 'tensorrt_llm' I0705 08:07:40.119915 79385 model_lifecycle.cc:835] successfully loaded 'tensorrt_llm' I0705 08:07:40.120172 79387 server.cc:607] +------------------+------+ | Repository Agent | Path | +------------------+------+ +------------------+------+

I0705 08:07:40.120320 79385 server.cc:607] +------------------+------+ | Repository Agent | Path | +------------------+------+ +------------------+------+

I0705 08:07:40.120417 79387 server.cc:634] +-------------+-----------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Backend | Path | Config | +-------------+-----------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+ | tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} | +-------------+-----------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0705 08:07:40.120489 79385 server.cc:634] +-------------+-----------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Backend | Path | Config | +-------------+-----------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+ | tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} | +-------------+-----------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0705 08:07:40.120495 79387 server.cc:677] +--------------+---------+--------+ | Model | Version | Status | +--------------+---------+--------+ | tensorrt_llm | 1 | READY | +--------------+---------+--------+

I0705 08:07:40.120545 79385 server.cc:677] +--------------+---------+--------+ | Model | Version | Status | +--------------+---------+--------+ | tensorrt_llm | 1 | READY | +--------------+---------+--------+

I0705 08:07:40.121322 79386 model_lifecycle.cc:835] successfully loaded 'tensorrt_llm' I0705 08:07:40.121743 79386 server.cc:607] +------------------+------+ | Repository Agent | Path | +------------------+------+ +------------------+------+

I0705 08:07:40.121919 79386 server.cc:634] +-------------+-----------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Backend | Path | Config | +-------------+-----------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+ | tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} | +-------------+-----------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0705 08:07:40.122013 79386 server.cc:677] +--------------+---------+--------+ | Model | Version | Status | +--------------+---------+--------+ | tensorrt_llm | 1 | READY | +--------------+---------+--------+

I0705 08:07:40.179571 79387 metrics.cc:877] Collecting metrics for GPU 0: NVIDIA L20 I0705 08:07:40.179605 79387 metrics.cc:877] Collecting metrics for GPU 1: NVIDIA L20 I0705 08:07:40.179612 79387 metrics.cc:877] Collecting metrics for GPU 2: NVIDIA L20 I0705 08:07:40.179618 79387 metrics.cc:877] Collecting metrics for GPU 3: NVIDIA L20 I0705 08:07:40.192136 79386 metrics.cc:877] Collecting metrics for GPU 0: NVIDIA L20 I0705 08:07:40.192164 79386 metrics.cc:877] Collecting metrics for GPU 1: NVIDIA L20 I0705 08:07:40.192172 79386 metrics.cc:877] Collecting metrics for GPU 2: NVIDIA L20 I0705 08:07:40.192178 79386 metrics.cc:877] Collecting metrics for GPU 3: NVIDIA L20 I0705 08:07:40.199299 79387 metrics.cc:770] Collecting CPU metrics I0705 08:07:40.199513 79387 tritonserver.cc:2538] +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Option | Value | +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | server_id | triton | | server_version | 2.44.0 | | server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging | | model_repository_path[0] | /app/qwen_ifb/ | | model_control_mode | MODE_EXPLICIT | | startup_models_0 | tensorrt_llm | | strict_model_config | 1 | | rate_limit | OFF | | pinned_memory_pool_byte_size | 268435456 | | cuda_memory_pool_byte_size{0} | 67108864 | | cuda_memory_pool_byte_size{1} | 67108864 | | cuda_memory_pool_byte_size{2} | 67108864 | | cuda_memory_pool_byte_size{3} | 67108864 | | min_supported_compute_capability | 6.0 | | strict_readiness | 1 | | exit_timeout | 30 | | cache_enabled | 0 | +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0705 08:07:40.199965 79385 metrics.cc:877] Collecting metrics for GPU 0: NVIDIA L20 I0705 08:07:40.200012 79385 metrics.cc:877] Collecting metrics for GPU 1: NVIDIA L20 I0705 08:07:40.200023 79385 metrics.cc:877] Collecting metrics for GPU 2: NVIDIA L20 I0705 08:07:40.200033 79385 metrics.cc:877] Collecting metrics for GPU 3: NVIDIA L20 I0705 08:07:40.201625 79387 grpc_server.cc:2466] Started GRPCInferenceService at 0.0.0.0:18001 I0705 08:07:40.201876 79387 http_server.cc:4636] Started HTTPService at 0.0.0.0:18000 I0705 08:07:40.217713 79386 metrics.cc:770] Collecting CPU metrics I0705 08:07:40.217959 79386 tritonserver.cc:2538] +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Option | Value | +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | server_id | triton | | server_version | 2.44.0 | | server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging | | model_repository_path[0] | /app/qwen_ifb/ | | model_control_mode | MODE_EXPLICIT | | startup_models_0 | tensorrt_llm | | strict_model_config | 1 | | rate_limit | OFF | | pinned_memory_pool_byte_size | 268435456 | | cuda_memory_pool_byte_size{0} | 67108864 | | cuda_memory_pool_byte_size{1} | 67108864 | | cuda_memory_pool_byte_size{2} | 67108864 | | cuda_memory_pool_byte_size{3} | 67108864 | | min_supported_compute_capability | 6.0 | | strict_readiness | 1 | | exit_timeout | 30 | | cache_enabled | 0 | +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

E0705 08:07:40.219476055 79386 chttp2_server.cc:1080] UNKNOWN:No address added out of total 1 resolved for '0.0.0.0:18001' {created_time:"2024-07-05T08:07:40.219384819+00:00", children:[UNKNOWN:Failed to add any wildcard listeners {created_time:"2024-07-05T08:07:40.219362598+00:00", children:[UNKNOWN:Unable to configure socket {fd:180, created_time:"2024-07-05T08:07:40.21932451+00:00", children:[UNKNOWN:Address already in use {syscall:"bind", os_error:"Address already in use", errno:98, created_time:"2024-07-05T08:07:40.21928166+00:00"}]}, UNKNOWN:Unable to configure socket {fd:180, created_time:"2024-07-05T08:07:40.219357544+00:00", children:[UNKNOWN:Address already in use {syscall:"bind", os_error:"Address already in use", errno:98, created_time:"2024-07-05T08:07:40.219352483+00:00"}]}]}]} E0705 08:07:40.219789 79386 main.cc:245] failed to start GRPC service: Unavailable - Socket '0.0.0.0:18001' already in use I0705 08:07:40.229760 79385 metrics.cc:770] Collecting CPU metrics I0705 08:07:40.230178 79385 tritonserver.cc:2538] +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Option | Value | +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | server_id | triton | | server_version | 2.44.0 | | server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging | | model_repository_path[0] | /app/qwen_ifb/ | | model_control_mode | MODE_EXPLICIT | | startup_models_0 | tensorrt_llm | | strict_model_config | 1 | | rate_limit | OFF | | pinned_memory_pool_byte_size | 268435456 | | cuda_memory_pool_byte_size{0} | 67108864 | | cuda_memory_pool_byte_size{1} | 67108864 | | cuda_memory_pool_byte_size{2} | 67108864 | | cuda_memory_pool_byte_size{3} | 67108864 | | min_supported_compute_capability | 6.0 | | strict_readiness | 1 | | exit_timeout | 30 | | cache_enabled | 0 | +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

E0705 08:07:40.232606618 79385 chttp2_server.cc:1080] _UNKNOWN:No address added out of total 1 resolved for '0.0.0.0:18001' {created_time:"2024-07-05T08:07:40.232429669+00:00", children:[UNKNOWN:Failed to add any wildcard listeners {created_time:"2024-07-05T08:07:40.232381555+00:00", children:[UNKNOWN:Unable to configure socket {fd:180, created_time:"2024-07-05T08:07:40.232300483+00:00", children:[UNKNOWN:Address already in use {syscall:"bind", os_error:"Address already in use", errno:98, created_time:"2024-07-05T08:07:40.232221534+00:00"}]}, UNKNOWN:Unable to configure socket {created_time:"2024-07-05T08:07:40.232372083+00:00", fd:180, children:[UNKNOWN:Address already in use {syscall:"bind", os_error:"Address already in use", errno:98, createdtime:"2024-07-05T08:07:40.232360498+00:00"}]}]}]} E0705 08:07:40.232992 79385 main.cc:245] failed to start GRPC service: Unavailable - Socket '0.0.0.0:18001' already in use I0705 08:07:40.243374 79387 http_server.cc:320] Started Metrics Service at 0.0.0.0:18002 Cleaning up... Cleaning up... Cleaning up... error: creating server: Internal - failed to load all models

additional notes

no progress use port 18000,18001and 18002 before launch triton server in the container

chrjxj commented 3 months ago

i can't reproduce above; it works okay on my test.

triton_log.txt