microsoft / onnxruntime-genai

Generative AI extensions for onnxruntime
MIT License
413 stars 94 forks source link

Documentation for running inference or a pre-built inference server #410

Closed Ben-Epstein closed 4 months ago

Ben-Epstein commented 4 months ago

This library is great, I've been testing phi-3-mini-128k, and this by far the fastest runtime for it. For a non-onnx model, id use TGI but presumably you have a more optimized setup for onnx models?

Do you have documentation for best practices around how to deploy the model and handle things like batching, streaming etc? Or, are you planning on building some RESTful server that can be deployed through a docker image?

Might be related to https://github.com/microsoft/onnxruntime-genai/issues/313 but it's not clear what that issue is asking for

Thanks!

natke commented 4 months ago

Hi @Ben-Epstein, thank you for the feedback! Yes, we do have the RESTful server feature on our roadmap and we are discussing the delivery timeframe for that. In the meantime I am moving this issue into our Discussions forum so that we can continue the discussion