Documentation for running inference or a pre-built inference server

microsoft / onnxruntime-genai

Generative AI extensions for onnxruntime

MIT License

413 stars 94 forks source link

This library is great, I've been testing phi-3-mini-128k, and this by far the fastest runtime for it. For a non-onnx model, id use TGI but presumably you have a more optimized setup for onnx models?

Do you have documentation for best practices around how to deploy the model and handle things like batching, streaming etc? Or, are you planning on building some RESTful server that can be deployed through a docker image?

Might be related to https://github.com/microsoft/onnxruntime-genai/issues/313 but it's not clear what that issue is asking for

Thanks!

microsoft / onnxruntime-genai

Documentation for running inference or a pre-built inference server #410