triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.35k stars 1.49k forks source link

Automatically unload (oldest) models when memory is full #7279

Open elmuz opened 5 months ago

elmuz commented 5 months ago

Is your feature request related to a problem? Please describe. I am asking the recommended way to achieve the following behavior.

SCENARIO: I have many different models. Consider them different and independent, and not one the upgrade of the other. I would like to give the user the possibility to test them all, but unfortunately the GPU memory is not enough to fit them all.

The current API allow me to explicitly load/unload models. However, load > inference > unload is not an atomic sequence. If two clients ask for opposite actions it will create an inconsistent state.

Describe the solution you'd like The ideal behavior would be something like this: (from the inference server perspective)

  1. I receive an inference request. Do I have that model loaded and ready to execute inference?
    • If yes, then compute inference -> END
    • If no, try to load the model. -> Go to 2.
  2. Does the model exist in the repository?
    • If no, raise exception -> END
    • if yes, try to load. -> Go to 3.
  3. Is the available memory enough?
    • If yes, load the model -> END
    • If no, go to 4.
  4. sort loaded models in terms of requests datetime, from the oldest to the most recent
  5. unload the oldest model in the list. Go to 3.

This logic should be handled at server side. The user (client) is only supposed to request, and (if unlucky) wait for some time until the output is received.

I am not sure if there is already a way to obtain this. If not, what's the simplest way you would recommend to develop something on my end to obtain this behavior?

tanmayv25 commented 5 months ago

There are couple of challenges with the design of such a solution and why Triton has only exposed certain hooks to handle models explicitly.

Is the available memory enough?

This is a hard value to determine. We might get an estimate on how much memory a model weights will occupy on GPU, but at the runtime there is usually a dynamic memory allocation based on the tensor size that gets fed to the model. It is made worse by dynamic shapes and data-dependent dynamic shape outputs. Not only this but certain DL frameworks allocate memory pools on the GPU which never gets released.

Hence, it is difficult to determine whether a typical model would fit the memory by heurestics in a generic serving application. Not only this, but the Least-Recently Used model availability that you have proposed seems is not a generic solution as it depends significantly on the workload that can be specific to the service use-case.

We do understand the value in your proposal and would see if it makes sense to support this feature. However, it is difficult for us to provide an optimal and robust solution to this problem.

This logic should be handled at server side. The user (client) is only supposed to request, and (if unlucky) wait for some time until the output is received.

Meanwhile, you can use a side-car service to handle the availability of model based upon your expected workload. For your use-case it would even make more sense to add a middleman service between the client and the server which can ensure the availability of model before trying to run inference on Triton.

Your middleman service can utilize the knowledge of the models included in the repository in making these decisions.

elmuz commented 5 months ago

Ok, I understand your point. For the moment, I will create a middleman logic for my specific need. Thank you.

EDIT: I had closed, but it's maybe better to keep the issue open unless you decide it's outside the scope. Anyway, triage it's up to you.