How to serve n identical models (except for their weights) without using n times the GPU memory ?

julienripoche commented 1 year ago

To begin, I would like to thank the triton inference server team ! You provide us with a very convenient tool to deploy deep learning models :)

Is your feature request related to a problem? Please describe. I'm working on an object detection problem with a lot of labels. Because I need very high precision for my use case, I trained several object detection models on different sets of labels. Thus these models are exactly identical in their structures but with differents weights. I converted these models to ONNX and then to tensorRT. When I load these tensorRT engines in triton server, they require as much GPU memory as they are engines, which is pretty standard. But considering the larger number of these models in my case it is not very convenient as it requires me to use a GPU with a lot of memory and I would like to be able to use a GPU with much less memory.

Describe the solution you'd like As these models are exactly identical in their structures, I would like to able to load the weights of these models but reserving only one time the GPU memory necessary to run the model, i.e. all the memory not related to the weights and reserved for the output of each layers. I wonder if there is already something in the tensorRT/tritonserver ecosystem that exists and can do what I described.

Describe alternatives you've considered na

Additional context na

dyastremsky commented 1 year ago

Thank you for the kind message and feature request. I've filed a ticket for us to investigate this potential enhancement.

julienripoche commented 1 year ago

Hi @dyastremsky, just checking the status of the feature request, is there any news ? Or can you give me any leads in order implement it in a custom backend myself ?

dyastremsky commented 1 year ago

Not yet, it's still in our queue.

You'd need to see if the TensorRT API has any way to support this kind of model swapping. You'd probably be able to build a custom version of the current TensorRT backend. You'd potentially need to update the server or core repo logic to recognize multiple versions with shared libraries (e.g. via model config loading and figure out how to do the file I/O mapping for these models) then build a custom server to plug your backend into.

triton-inference-server / server

How to serve n identical models (except for their weights) without using n times the GPU memory ? #5294