sillsdev / serval

A REST API for natural language processing services
MIT License
4 stars 0 forks source link

Real-time Inference dynamic loading/unloading of models #211

Open johnml1135 opened 1 year ago

johnml1135 commented 1 year ago

Assuming that we use clearml-serving real time inferencing, we may need to spin our own dynamic loading/unloading algorithm the reason is that the core Triton inference server from NVIDIA does not do it without buying the enterprise plan.

If we were to do this ourselves, we would need to make the explicit calls to the management API and implement a simple algorithm such as:

ddaspit commented 1 year ago

We have something like this implemented for SMT models.

johnml1135 commented 1 year ago

ClearML may have it already implemented:

We support that, but this is Not dynamically loaded, this is just removing and adding models, this does not unload them from the GRAM. That's the main issue. when we unload the model, it is unloaded, to do dynamic, they need to be able to save it in RAM and unload it from GRAM, that's the feature that is missing on all Triton deployments. Does that make sense ?

johnml1135 commented 1 year ago

Just to confirm, yes, ClearML can do automatic loading/unloading, but each load/unload will take time: there is also deserializing CPU time (imaging unpickling 20GB file, this takes time... and actually this is the main bottle neck not just IO).

robosina commented 10 months ago

@johnml1135 @ddaspit, Hello, I have a question about this matter, will this loading/unloading happen automatically, or do we need to do something to enable it? Also, if you could provide me with some links/documentation, I would appreciate it.

johnml1135 commented 10 months ago

@robosina - this is currently a wish list item and a conceptual design, it has not been implemented into Serval. The core technology that would perform the loading/unloading would be https://github.com/allegroai/clearml-serving, which is a layer on top of https://www.nvidia.com/en-us/ai-data-science/products/triton-management-service/. I would review those products for dynamic loading/unloading.