triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8k stars 1.44k forks source link

Cant build python+onnx+ternsorrtllm backends r24.04 #7236

Open gulldan opened 3 months ago

gulldan commented 3 months ago

Im trying https://github.com/triton-inference-server/server/blob/main/docs/customization_guide/compose.md to build onnx+python+tensorrtllm backends.

1) as mention in doc i do

git clone --single-branch --depth=1 -b r24.04 https://github.com/triton-inference-server/server.git
python3 compose.py --backend onnxruntime --backend python --repoagent checksum --image min,nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3 --image full,nvcr.io/nvidia/tritonserver:24.04-py3

and it builds, but when i start triton server.

E0517 12:18:34.314931 164 model_lifecycle.cc:638] failed to load 'llama3_tensorrt_llm' version 1: Invalid argument: unable to find backend library for backend 'tensorrtllm', try specifying runtime on the mode configuration.

models with python, onnx loads correct.

How i can combine docker image for using both backends?

2)

python3 compose.py --backend tensorrtllm --backend python --backend onnxruntime --repoagent checksum --container-version 24.04

failed, tensorrt not found

=> CACHED [stage-1 16/23] COPY --chown=1000:1000 --from=full /opt/tritonserver/include include/                                                                                                               0.0s
 => CACHED [stage-1 17/23] COPY --chown=1000:1000 --from=full /opt/tritonserver/backends/python /opt/tritonserver/backends/python                                                                              0.0s
 => CACHED [stage-1 18/23] COPY --chown=1000:1000 --from=full /opt/tritonserver/backends/onnxruntime /opt/tritonserver/backends/onnxruntime                                                                    0.0s
 => ERROR [stage-1 19/23] COPY --chown=1000:1000 --from=full /opt/tritonserver/backends/tensorrtllm /opt/tritonserver/backends/tensorrtllm
statiraju commented 3 months ago

Tracking ticket: [DLIS-6397]

rmccorm4 commented 3 months ago

Hi @gulldan, compose.py doesn't currently support the TensorRT-LLM backend (DLIS-6397).

You should be able to achieve something similar by using build.py with:

--backend tensorrtllm:r24.04
--backend python:r24.04
--backend onnxruntime:r24.04

https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/customization_guide/build.html#building-with-docker

Let us know if this helps for your use case.

gulldan commented 3 months ago

thank you.

i tried

./build.py --backend tensorrtllm:r24.04 --backend python:r24.04 --backend onnxruntime:r24.04 --enable-gpu --build-type Release --target-platform linux --endpoint grpc --endpoint http

but its failed build_log.txt

Host info Linux 6.5.0-35-generic #35-Ubuntu SMP PREEMPT_DYNAMIC Fri Apr 26 11:23:57 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Docker version 26.1.2, build 211e74b cmake version 3.28.4 python 3.11.6 GeForce RTX 4090 Driver Version: 550.54.15