Open elmuz opened 5 months ago
There are couple of challenges with the design of such a solution and why Triton has only exposed certain hooks to handle models explicitly.
Is the available memory enough?
This is a hard value to determine. We might get an estimate on how much memory a model weights will occupy on GPU, but at the runtime there is usually a dynamic memory allocation based on the tensor size that gets fed to the model. It is made worse by dynamic shapes and data-dependent dynamic shape outputs. Not only this but certain DL frameworks allocate memory pools on the GPU which never gets released.
Hence, it is difficult to determine whether a typical model would fit the memory by heurestics in a generic serving application. Not only this, but the Least-Recently Used model availability that you have proposed seems is not a generic solution as it depends significantly on the workload that can be specific to the service use-case.
We do understand the value in your proposal and would see if it makes sense to support this feature. However, it is difficult for us to provide an optimal and robust solution to this problem.
This logic should be handled at server side. The user (client) is only supposed to request, and (if unlucky) wait for some time until the output is received.
Meanwhile, you can use a side-car service to handle the availability of model based upon your expected workload. For your use-case it would even make more sense to add a middleman service between the client and the server which can ensure the availability of model before trying to run inference on Triton.
Your middleman service can utilize the knowledge of the models included in the repository in making these decisions.
Ok, I understand your point. For the moment, I will create a middleman logic for my specific need. Thank you.
EDIT: I had closed, but it's maybe better to keep the issue open unless you decide it's outside the scope. Anyway, triage it's up to you.
Is your feature request related to a problem? Please describe. I am asking the recommended way to achieve the following behavior.
SCENARIO: I have many different models. Consider them different and independent, and not one the upgrade of the other. I would like to give the user the possibility to test them all, but unfortunately the GPU memory is not enough to fit them all.
The current API allow me to explicitly load/unload models. However, load > inference > unload is not an atomic sequence. If two clients ask for opposite actions it will create an inconsistent state.
Describe the solution you'd like The ideal behavior would be something like this: (from the inference server perspective)
This logic should be handled at server side. The user (client) is only supposed to request, and (if unlucky) wait for some time until the output is received.
I am not sure if there is already a way to obtain this. If not, what's the simplest way you would recommend to develop something on my end to obtain this behavior?