Invalid argument: unable to find backend library for backend '${triton_backend}'

chenchunhui97 commented 2 months ago

System Info

CPU architecture :× 86 64 ， aarch64)
GPU properties - GPU name NVIDIA A800
GPU memory size (80G)f known)
Libraries 一 TensorRT-LLM tag : v0.10.0 , tensorrtllm-backend tag: v0.10.0
Container used: nvidia/cuda:12.5.0-devel-ubuntu22.04 to generate engine, success, can run in local. nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3 or nvcr.io/nvidia/tritonserver:24.06-trtllm-python-py3 to launch triton, failed.
OS (Ubuntu 22 .04

Who can help?

@byshiue @sc

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Model name: Qwen1.5-14b-Chat

generate engine follwing steps in readme of TensorRT-LLM . successed.
launch service using triton. failed.

Expected behavior

launch the service successfully.

actual behavior

additional notes

alemantus commented 2 months ago

I get the exact same error using the tritonserver:24.05-trtllm-python-py3 container on a A100.

here4dadata commented 2 months ago

Set triton_backend to 'tensorrtllm' in the config.pbtxt for tensorrt_llm and it should work.

I think this was introduced because there is now a model.py file in tensorrt_llm/1 as of v0.10.0, but I have not come across anything explaining why this file is here along with what purpose it serves vs tensorrt_llm_bls.

Maybe someone could point us in the right direction regarding the need for this new parameter, and the new model.py file

byshiue commented 2 months ago

Thank you for the comments, @here4dadata . Your comment is correct. Some additional comments: the model.py is the python backend to use the tensorrt_llm. (In comparison, if you set triton_backend to tensorrtllm, it would be c++ triton backend).

triton-inference-server / tensorrtllm_backend