Open JamesBowerXanda opened 4 months ago
Hi @JamesBowerXanda, I also couldn't find anything in the onnxruntime backend repo that confirms its support for iterative sequences. However, since it's supported in python backend, I think you can use BLS with a python backend model that acts as a wrapper around the onnxruntime model to achieve this. Although, I am also not 100% sure about how efficient this setup would be.
Hi @JamesBowerXanda , would this tutorial: https://github.com/triton-inference-server/tutorials/tree/main/Conceptual_Guide/Part_7-iterative_scheduling answer your questions?
@oandreeva-nv , I will take a look but I was hoping it would be possible to do using the onnxruntime and it looks like this uses a Python backend.
This claims iterative sequences can be used but I cannot find any examples of how to use it. I was hoping to use it to improve the latency of my mt5 decoder model with key value caching that runs using onnxruntime.
Can I confirm the onnruntime backend supports iterative sequences?
If it does are there any examples/documentaiton or could you explain to me how the model would need to be set up?
I am using the sagemaker triton image for tritonserver 23.07 currently but could move to a later version if it enables this.