triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.34k stars 1.49k forks source link

Issue on page /user_guide/model_configuration.html #7430

Open JamesBowerXanda opened 4 months ago

JamesBowerXanda commented 4 months ago

This claims iterative sequences can be used but I cannot find any examples of how to use it. I was hoping to use it to improve the latency of my mt5 decoder model with key value caching that runs using onnxruntime.

Can I confirm the onnruntime backend supports iterative sequences?

If it does are there any examples/documentaiton or could you explain to me how the model would need to be set up?

I am using the sagemaker triton image for tritonserver 23.07 currently but could move to a later version if it enables this.

sourabh-burnwal commented 4 months ago

Hi @JamesBowerXanda, I also couldn't find anything in the onnxruntime backend repo that confirms its support for iterative sequences. However, since it's supported in python backend, I think you can use BLS with a python backend model that acts as a wrapper around the onnxruntime model to achieve this. Although, I am also not 100% sure about how efficient this setup would be.

oandreeva-nv commented 3 months ago

Hi @JamesBowerXanda , would this tutorial: https://github.com/triton-inference-server/tutorials/tree/main/Conceptual_Guide/Part_7-iterative_scheduling answer your questions?

JamesBowerXanda commented 3 months ago

@oandreeva-nv , I will take a look but I was hoping it would be possible to do using the onnxruntime and it looks like this uses a Python backend.