triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
654 stars 93 forks source link

server fails in Stuck when using pipeline parallel in multi-nodes #355

Open hezeli123 opened 6 months ago

hezeli123 commented 6 months ago

System Info

2 * 4 L40s load llama2-70B, 1 model: tensorrt_llm. using image: nvcr.io/nvidia/tritonserver:23.11-trtllm-python-py3

Who can help?

No response

Information

Tasks

Reproduction

1.python build.py --model_dir xxx --dtype float16 --remove_input_padding --enable_context_fmha --multi_block_mode --use_gemm_plugin float16 --use_gpt_attention_plugin float16 --use_inflight_batching --output_dir ./engine.inflight.tp4pp2.70b --world_size 8 --tp_size 4 --pp_size 2 --multi_block_mode --max_input_len 8192 --max_output_len 16384 --vocab_size=49954 2.mpirun -np 8 --allow-run-as-root --hostfile myhosts -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ucx trtionserver --model-repository=xxx --disable-auto-complete-config 3.python3 inflight_batcher_llm_client.py -u xxx:8001 --text "Hello, how " –tokenizer-dir=70b -S --request-output-len 40

Expected behavior

load sucessfully & infer successfully

actual behavior

when infering using client, server was stucked(4 GPUs in one nodes is 100%, but the other node's 4 GPU is 0): WARNING: Logging before InitGoogleLogging() is written to STDERR I20240227 08:25:18.989490 35588 grpc_server.cc:2495] Started GRPCInferenceService at 0.0.0.0:8001 I20240227 08:25:18.991181 35588 http_server.cc:4997] Started HTTPService at 0.0.0.0:8000 I20240227 08:25:19.032207 35588 http_server.cc:282] Started Metrics Service at 0.0.0.0:8002 dg11-train-prod001-node-10-224-96-171:13983:14088 [0] NCCL INFO Channel 00/1 : 0[0] -> 1[0] [receive] via NET/IBext/0/Shared dg11-train-prod001-node-10-224-96-171:13983:14088 [0] NCCL INFO Channel 01/1 : 0[0] -> 1[0] [receive] via NET/IBext/1/Shared dg11-train-prod001-node-10-224-96-171:13985:14089 [2] NCCL INFO Channel 00/1 : 0[2] -> 1[2] [receive] via NET/IBext/2/Shared dg11-train-prod001-node-10-224-96-171:13985:14089 [2] NCCL INFO Channel 01/1 : 0[2] -> 1[2] [receive] via NET/IBext/3/Shared dg11-train-prod001-node-10-224-96-171:13984:14090 [1] NCCL INFO Channel 00/1 : 0[1] -> 1[1] [receive] via NET/IBext/1/Shared dg11-train-prod001-node-10-224-96-171:13984:14090 [1] NCCL INFO Channel 01/1 : 0[1] -> 1[1] [receive] via NET/IBext/0/Shared dg11-train-prod001-node-10-224-96-171:13986:14091 [3] NCCL INFO Channel 00/1 : 0[3] -> 1[3] [receive] via NET/IBext/3/Shared dg11-train-prod001-node-10-224-96-171:13986:14091 [3] NCCL INFO Channel 01/1 : 0[3] -> 1[3] [receive] via NET/IBext/2/Shared dg11-train-prod001-node-10-224-96-174:35588:35724 [0] NCCL INFO Channel 00/1 : 0[0] -> 1[0] [send] via NET/IBext/0/Shared dg11-train-prod001-node-10-224-96-174:35588:35724 [0] NCCL INFO Channel 01/1 : 0[0] -> 1[0] [send] via NET/IBext/1/Shared dg11-train-prod001-node-10-224-96-174:35589:35725 [1] NCCL INFO Channel 00/1 : 0[1] -> 1[1] [send] via NET/IBext/1/Shared dg11-train-prod001-node-10-224-96-174:35589:35725 [1] NCCL INFO Channel 01/1 : 0[1] -> 1[1] [send] via NET/IBext/0/Shared dg11-train-prod001-node-10-224-96-174:35590:35726 [2] NCCL INFO Channel 00/1 : 0[2] -> 1[2] [send] via NET/IBext/2/Shared dg11-train-prod001-node-10-224-96-174:35590:35726 [2] NCCL INFO Channel 01/1 : 0[2] -> 1[2] [send] via NET/IBext/3/Shared dg11-train-prod001-node-10-224-96-174:35591:35727 [3] NCCL INFO Channel 00/1 : 0[3] -> 1[3] [send] via NET/IBext/3/Shared dg11-train-prod001-node-10-224-96-174:35591:35727 [3] NCCL INFO Channel 01/1 : 0[3] -> 1[3] [send] via NET/IBext/2/Shared

additional notes

when not using pp, server is ok(only using tp=8)

datdo-msft commented 2 months ago

Hi @hezeli123 , you said that when not using pipeline parellism this works for you. I assume you just omitted --pp_size or set it to 1 when you built the engines?

Also, when you ran mpirun, what does your hostfile look like? Just ip addresses of the two nodes? Do you specify any slots?

Asking because I was looking to do something similar -- ie, no pp (pp_size = 1) and tp=8. My mpirun command looked like this: mpirun -v -wd /workspace --allow-run-as-root --hostfile hostfile -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ucx /opt/tritonserver/bin/tritonserver --model-repository tensorrtllm_backend/all_models/inflight_batcher_llm and the hostfile looked like this:

10.0.0.4 port=2255 slots=1
10.0.0.5 port=2255 slots=1

But I ran into this error: UNAVAILABLE state: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: With communicationMode kLEADER, MPI worldSize is expected to be equal to tp*pp when participantIds are not specified (/tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/tensorrt_llm/executor/executorImpl.cpp:356)

hezeli123 commented 1 month ago

Hi @hezeli123 , you said that when not using pipeline parellism this works for you. I assume you just omitted --pp_size or set it to 1 when you built the engines?

Also, when you ran mpirun, what does your hostfile look like? Just ip addresses of the two nodes? Do you specify any slots?

Asking because I was looking to do something similar -- ie, no pp (pp_size = 1) and tp=8. My mpirun command looked like this: mpirun -v -wd /workspace --allow-run-as-root --hostfile hostfile -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ucx /opt/tritonserver/bin/tritonserver --model-repository tensorrtllm_backend/all_models/inflight_batcher_llm and the hostfile looked like this:

10.0.0.4 port=2255 slots=1
10.0.0.5 port=2255 slots=1

But I ran into this error: UNAVAILABLE state: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: With communicationMode kLEADER, MPI worldSize is expected to be equal to tp*pp when participantIds are not specified (/tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/tensorrt_llm/executor/executorImpl.cpp:356)

when using tp8,build engine through --world_size 8 --tp_size 8 , not setting pp hostfile : 10.222.96.174 slots=4 10.222.96.171 slots=4