triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
7.72k stars 1.42k forks source link

Is onnxruntime-genai supported? #7182

Open jackylu0124 opened 2 months ago

jackylu0124 commented 2 months ago

Hey all, I have a quick question, is onnxruntime-genai (https://onnxruntime.ai/docs/genai/api/python.html) supported in Triton Inference Server's ONNX runtime backend? I couldn't find relevant sources in the documentation. Thanks!

nnshah1 commented 2 months ago

@jackylu0124 Support for onnxruntime-genai is currently work in progress - the python bindings should work within the python backend - but we haven't had a chance to test that ourselves yet.

That being said we are actively investigating support - can you share more about your use case / timeline needed for support?

jackylu0124 commented 2 months ago

@jackylu0124 Support for onnxruntime-genai is currently work in progress - the python bindings should work within the python backend - but we haven't had a chance to test that ourselves yet.

That being said we are actively investigating support - can you share more about your use case / timeline needed for support?

Hi @nnshah1 , thank you very much for your fast reply! By "the python bindings should work within the python backend", you meant that I can do things like import onnxruntime_genai and write the custom inference logic in the Python backend, as opposed to having Triton Inference Server automatically manage all my .onnx model files (that use onnxruntime-genai) in the model repository for me (which is a feature currently in development), is my understanding correct?

My use case is mainly for serving LLM models, where some of which are in the form of ONNX models that depend on onnxruntime_genai. I don't have a specific timeline, I am mainly interested in knowing whether this feature is on Triton Inferencer Server's development roadmap or not.

Also a follow-up question: regarding serving LLM, what would be the best backend for serving and achieving token streaming outside of using the TensorRT-LLM backend?

Thanks!