Closed jasonngap1 closed 1 week ago
An update: used tensorrt_llm==0.10.0 to convert checkpoints and compile model, but currently receiving error regarding: Assertion failed: Failed to deserialize cuda engine when using triton server version nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3
The team is looking into the issue and respond asap.
Hi @statiraju sorry for not updating but I have managed to solve the issue by aligning the tensorrt-llm versions in both the compiling of the model and the trition server. Thanks!
Description Unable to run triton inference server with tensorrt-llm for Llama3-ChatQA-1.5-8B
Triton Information v2.46.0
Are you using the Triton container or did you build it yourself? Using Triton container image nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3
To Reproduce
Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well). For preprocessing:
For postprocessing:
For tensorrt_llm:
Expected behavior I would expect Triton endpoints to be loaded. Instead, I got an error and here is my logs: