triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
617 stars 87 forks source link

Unable to launch Triton Server Across Multi Nodes #283

Open jianqiylz opened 6 months ago

jianqiylz commented 6 months ago

Hello, When trying to run the tritonserver on a setup with 4 nodes, I face an failure that seems to suggest a mismatch between the number of GPUs per node and the tensor parallel (TP) * pipeline parallel (PP) sizes. The error message is as follows:

+------------------+---------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Model            | Version | Status                                                                                                                                                                             |
+------------------+---------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| postprocessing   | 1       | READY                                                                                                                                                                              |
| preprocessing    | 1       | READY                                                                                                                                                                              |
| tensorrt_llm     | 1       | UNAVAILABLE: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Number of GPUs per node 1 must be at least as large as TP (4) *  |
|                  |         | PP (1) (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/worldConfig.cpp:74)                                                                                                             |
|                  |         | 1       0x7f5c8b4936fd /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x166fd) [0x7f5c8b4936fd]                                                                  |
|                  |         | 2       0x7f5c8b4aaa82 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x2da82) [0x7f5c8b4aaa82]                                                                  |
|                  |         | 3       0x7f5c8b5ad528 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x130528) [0x7f5c8b5ad528]                                                                 |
|                  |         | 4       0x7f5c8b4eeac8 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x71ac8) [0x7f5c8b4eeac8]                                                                  |
|                  |         | 5       0x7f5c8b4e4e08 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x67e08) [0x7f5c8b4e4e08]                                                                  |
|                  |         | 6       0x7f5c8b4c614e /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x4914e) [0x7f5c8b4c614e]                                                                  |
|                  |         | 7       0x7f5c8b4c7242 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x4a242) [0x7f5c8b4c7242]                                                                  |
|                  |         | 8       0x7f5c8b4b6dd5 TRITONBACKEND_ModelInstanceInitialize + 101                                                                                                                 |
|                  |         | 9       0x7f5d06d9aa86 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a4a86) [0x7f5d06d9aa86]                                                                                 |
|                  |         | 10      0x7f5d06d9bcc6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a5cc6) [0x7f5d06d9bcc6]                                                                                 |
|                  |         | 11      0x7f5d06d7ec15 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x188c15) [0x7f5d06d7ec15]                                                                                 |
|                  |         | 12      0x7f5d06d7f256 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x189256) [0x7f5d06d7f256]                                                                                 |
|                  |         | 13      0x7f5d06d8b27d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19527d) [0x7f5d06d8b27d]                                                                                 |
|                  |         | 14      0x7f5d063f9ee8 /lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7f5d063f9ee8]                                                                                                  |
|                  |         | 15      0x7f5d06d7597b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x17f97b) [0x7f5d06d7597b]                                                                                 |
|                  |         | 16      0x7f5d06d85695 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18f695) [0x7f5d06d85695]                                                                                 |
|                  |         | 17      0x7f5d06d8a50b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19450b) [0x7f5d06d8a50b]                                                                                 |
|                  |         | 18      0x7f5d06e73610 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x27d610) [0x7f5d06e73610]                                                                                 |
|                  |         | 19      0x7f5d06e76d03 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x280d03) [0x7f5d06e76d03]                                                                                 |
|                  |         | 20      0x7f5d06fc38b2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3cd8b2) [0x7f5d06fc38b2]                                                                                 |
|                  |         | 21      0x7f5d06664253 /lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f5d06664253]                                                                                             |
|                  |         | 22      0x7f5d063f4ac3 /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f5d063f4ac3]                                                                                                  |
|                  |         | 23      0x7f5d06486a40 /lib/x86_64-linux-gnu/libc.so.6(+0x126a40) [0x7f5d06486a40]                                                                                                 |
| tensorrt_llm_bls | 1       | READY                                                                                                                                                                              |
+------------------+---------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Environment Information

Steps to Reproduce

  1. Built the Docker image using the following commands:

    cd tensorrtllm_backend
    git lfs install
    git submodule update --init --recursive
    DOCKER_BUILDKIT=1 docker build -t triton_trt_llm -f dockerfile/Dockerfile.trt_llm_backend .
  2. Compiled the TensorRT engine within a container launched from the above image:

    cd tensorrt_llm
    python build.py --model_dir ./Yi-34B-Chat/ --dtype bfloat16 \
        --remove_input_padding --use_gpt_attention_plugin bfloat16 --use_gemm_plugin bfloat16 \
        --output_dir ./Yi-34B-Chat-out/bf16/4-gpu/  --rotary_base 1000000 --vocab_size 64000 --world_size 4 --tp_size 4 --gpus_per_node=1 \
        --enable_context_fmha --max_input_len 15360 --max_output_len 2048 --max_batch_size 4
  3. Before running the tests, I administered a workaround in the mapping.py file to align with my hardware:

    vim /usr/local/lib/python3.10/dist-packages/tensorrt_llm/mapping.py
    # Changed line 38 from "gpus_per_node=8" to "gpus_per_node=1"
  4. Tested the multi-node execution with run.py. Ran the test with the following command which worked fine and inference results were obtained successfully:

    mpirun -np 4 --allow-run-as-root \
        --hostfile /tensorrtllm_backend/hostfile \
        -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
        -mca pml ucx  \
        python ../run.py --log_level="info" --max_output_len=40 --max_input_length=256 --tokenizer_dir ./Yi-34B-Chat/ \
        --engine_dir ./Yi-34B-Chat-out/bf16/4-gpu/ --input_text "what is llama?"
  5. Encountered the issue when starting the Triton server with the below command:

    mpirun -np 4 --allow-run-as-root \
        --hostfile /tensorrtllm_backend/hostfile \
        -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
        -mca pml ucx  \
        /opt/tritonserver/bin/tritonserver --model-repo=/tensorrtllm_backend/triton_model_repo --disable-auto-complete-config

From what I could observe, it seems like the tritonserver may not yet support run across multiple nodes. If my understanding is correct, could you please consider adding support for this functionality, or provide any guidance on how I can work around this limitation?

Sincerely appreciate any assistance you can offer on this matter.

jianqiylz commented 6 months ago

@juney-nvidia Hello, could you please take a look at this issue? If you need any further information, please let me know.

jianqiylz commented 6 months ago

@byshiue @juney-nvidia pls

jfpichlme commented 2 weeks ago

Any updates?

datdo-msft commented 2 weeks ago

Hi @byshiue @juney-nvidia , did you get the chance to look into this?

@jianqiylz , wanted to check in, were you able to resolve this?