Open lzcchl opened 1 month ago
The multimodal example on backend is on going because the backend of multimodal is very different to pure decoder (like GPU)
Have you found a demonstration program or solution? I want to deploy the quantized Baichuan large model using Triton, based on the example provided by TensorRT-LLM, but I'm still missing some clues.
Have you found a demonstration program or solution? I want to deploy the quantized Baichuan large model using Triton, based on the example provided by TensorRT-LLM, but I'm still missing some clues.
If you only want to use Baichuan model (only decoder), it should works and you can refer the documents like baichuan.md. If you want to use multimodal, it is still on going.
as byshiue said, the multimodal example on backend is on going. If you want to use a multimodal model in the Triton infer server framework right now, please use python-backend as an transitional solution.
Thank you very much. I have now completed the quantization of the Baichuan2-13B model. However, I am still unclear about the part of using Triton to deploy the model and obtain the inference interface, specifically how to access the inference interface provided by Triton externally. I noticed that the official Triton image has been updated, but I am still unsure about how to use the image to expose the interface for external access. Are there any relevant materials available? Best wishes.
there is an example https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/qwenvl , but I have no idea how can I use this model in triton server, Can you provide an example of a visual language model or multimodal model?