triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.27k stars 1.47k forks source link

Docs for Multi model serving with over-commit #5689

Open fahimkm opened 1 year ago

fahimkm commented 1 year ago

Is your feature request related to a problem? Please describe. I read in the seldon core documentation that multi-model serving with overcommit is available out of the box on nvidia triton https://docs.seldon.io/projects/seldon-core/en/v2/contents/models/mms/mms.html?highlight=multi%20modal%20serving

Describe the solution you'd like Can you please share documentation on how to configure and implement multi-model serving with overcommit using Nvida Triton?

dyastremsky commented 1 year ago

It looks like Seldon says that Triton supports multi-model serving, which it does. That's out of the box just by loading multiple models onto Triton and our basic documentation covers that.

Overcommitting is not supported by Triton. You can see how model management works here. What you're talking about could be accomplished using EXPLICIT mode if you create the logic for loading and unloading models as needed.

I'm going to mark this as an enhancement. We've filed a ticket to investigate adding this feature.