Real-time Inference dynamic loading/unloading of models

johnml1135 commented 1 year ago

Assuming that we use clearml-serving real time inferencing, we may need to spin our own dynamic loading/unloading algorithm the reason is that the core Triton inference server from NVIDIA does not do it without buying the enterprise plan.

https://github.com/triton-inference-server/server/issues/5345
https://github.com/triton-inference-server/server/issues/3583
https://www.nvidia.com/en-us/ai-data-science/products/triton-management-service/ I made a request to ClearML to implement this in their slack channel, but am assuming it will not be done.

If we were to do this ourselves, we would need to make the explicit calls to the management API and implement a simple algorithm such as:

Assume:
- All models are of the same size when loaded
- The max number of instances of an individual model is 1
Config:
- Number of seconds to assess usage over (rule of thumb -> 5x model loading time?)
- Auto-unload model if not being used for x minutes (default 5?)
- Number of models that need to be unloaded before x minutes required to adding new auto-scaled instance (default 5?)
Algorithm:
- Load in the model with the largest number of elements in it's queue - and only pull in one at a time
- If not enough space, unload the model with the oldest "last inference" time if it is over n (60?) seconds ago
- Else, unload the model that has an empty queue and also has the least number of incoming requests over the past n (60?) seconds
- If the frequency of unloading models is greater than the threshold, add another auto-scaled instance
- If the loaded models can fit on fewer instances than are currently scaled, gracefully consolidate

ddaspit commented 1 year ago

We have something like this implemented for SMT models.

johnml1135 commented 1 year ago

ClearML may have it already implemented:

We support that, but this is Not dynamically loaded, this is just removing and adding models, this does not unload them from the GRAM. That's the main issue. when we unload the model, it is unloaded, to do dynamic, they need to be able to save it in RAM and unload it from GRAM, that's the feature that is missing on all Triton deployments. Does that make sense ?

johnml1135 commented 1 year ago

Just to confirm, yes, ClearML can do automatic loading/unloading, but each load/unload will take time: there is also deserializing CPU time (imaging unpickling 20GB file, this takes time... and actually this is the main bottle neck not just IO).

robosina commented 10 months ago

@johnml1135 @ddaspit, Hello, I have a question about this matter, will this loading/unloading happen automatically, or do we need to do something to enable it? Also, if you could provide me with some links/documentation, I would appreciate it.

johnml1135 commented 10 months ago

@robosina - this is currently a wish list item and a conceptual design, it has not been implemented into Serval. The core technology that would perform the loading/unloading would be https://github.com/allegroai/clearml-serving, which is a layer on top of https://www.nvidia.com/en-us/ai-data-science/products/triton-management-service/. I would review those products for dynamic loading/unloading.

sillsdev / serval

Real-time Inference dynamic loading/unloading of models #211