Open JamesBowerXanda opened 1 week ago
hi @JamesBowerXanda there is also a newer version available: 007439368137.dkr.ecr.us-east-2.amazonaws.com/sagemaker-tritonserver:24.03-py3
regarding image: https://github.com/aws/deep-learning-containers/blob/master/available_images.md#nvidia-triton-inference-containers-sm-support-only - did you try it?
Description I am using the Sagemaker Triton Inference Server containers to run a MultiModel endpoint. One of the models is a MT5 model. I am trying to optimise for the latency and think I am losing time due to data transfer since when I use an equivalent instance type in a notebook with onnxruntime the generation pipeline takes 0.5 seconds but when I send a request through the triton inference server endpoint (with no other models loaded in) the execution time is around 2.5 seconds.
The model is split into an encoder_model.onnx, decoder_model.onnx and decoder_with_past_model.onnx.
What is the best way to optimise this?
Happy to restructure if there is a better way of doing it but I am running multiple models on the sagme gpu instance.
Triton Information 23.08
Are you using the Triton container or did you build it yourself? Sagemaker container as mentioned here
To Reproduce Take a t5 or mt5 model and use optimum to get the constituent onnx model.
optimum-cli export onnx --model google/mt5-small onnx-model --device cuda --optimise O4
Take the encoder_model.onnx, decoder_model.onnx and decoder_with_past_model.onnx files and add them to a triton inference server model repository as onnx model running on GPU. I will put the config.pbtxt files at the bottom.
Create a Python BLS model with the model.py file:
Run the inference with the model. Below are the relevant config.pbtxt files.
Expected behavior I expected the pipeline for a generation request to take approximately the same amount of time rather than 5 times longer.