triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.39k stars 1.49k forks source link

How many instances can Triton support for parallel inference at most? #7641

Open wwdok opened 2 months ago

wwdok commented 2 months ago

Suppose I have a 24GB memory GPU, A30, and my model is a 200MB wav2lip model. If I choose TensorRT as the inference framework, can we estimate the maximum number of models that can run in parallel using Triton? For example, if the Triton context consumes x GB of GPU memory, and TensorRT workspace occupies 2 GB, is the maximum number of parallel instances (24 - x) / 2? If so, what is x? Currently, I am using GPU virtualization as the solution to execute concurrent model inference, but this solution can only support up to 5 parallel instances. I want to know if Triton can be an altenative and support more instances compared to it, as this is very important for cost savings.

My use case is to use wav2lip to generate new lip-synced video based on audio. During inference, I take 4 frames of the video as a batch for a inference, so multiple inferences are needed to get the complete synthesized video result. The simplified workflow is shown below:

image What do you think about how Triton can be integrated into this workflow? Which features of Triton can be used?"