Deployment of TensorRT-LLM Model on Triton Server

jasonngap1 commented 3 months ago

Hi, I am trying to deploy a mistral-7b-instruct model on the triton server, but have met with difficulties. I have successfully converted my Mistral model using trtllm-build in the llama example in the TensorRT-LLM repo but I am not sure how to deploy on the Triton Server. There seem to be many ways to do so and I have tried creating a tensorrt_llm backend and an ensemble backend but both does not work. Is it possible to advice on what I should do? I would like to create an endpoint such that I can pass a prompt to the mistral model on the Triton server to return generated text.

Here are the steps I have done: After pulling the Mistral model weights, I have converted the raw model weights into tensorrt-llm checkpoint format

python convert_checkpoint.py --model_dir Mistral-7B-Instruct-v0.2 \
                             --output_dir Mistral-7B-Instruct-TensorRT/ \
                             --dtype float16 \
                             --weight_only_precision int8

I have built the engine needed (this returns me with a config.json and rank0.engine file):

trtllm-build --checkpoint_dir Mistral-7B-Instruct-TensorRT/ \
            --output_dir Mistral-7B-Instruct-compiled/ \
            --gpt_attention_plugin float16 \
            --gemm_plugin float16 \
            --max_input_len 32256

I went on to pull the latest triton server version 24.02 and tried to deploy the tensorrt-llm model but have met with the error: UNAVAILABLE: Invalid argument: unable to find backend library for backend 'tensorrtllm', try specifying runtime on the model configuration

byshiue commented 3 months ago

Could you share the docker image you use? It looks the server does not find tensorrt_llm successfully.

jasonngap1 commented 3 months ago

Hi, I managed to solve the issue by installing tensorrt-llm using pip, instead of building from source. This issue can be closed.

triton-inference-server / tensorrtllm_backend

Deployment of TensorRT-LLM Model on Triton Server #379