Open johnml1135 opened 1 year ago
We have something like this implemented for SMT models.
ClearML may have it already implemented:
We support that, but this is Not dynamically loaded, this is just removing and adding models, this does not unload them from the GRAM. That's the main issue. when we unload the model, it is unloaded, to do dynamic, they need to be able to save it in RAM and unload it from GRAM, that's the feature that is missing on all Triton deployments. Does that make sense ?
Just to confirm, yes, ClearML can do automatic loading/unloading, but each load/unload will take time: there is also deserializing CPU time (imaging unpickling 20GB file, this takes time... and actually this is the main bottle neck not just IO).
@johnml1135 @ddaspit, Hello, I have a question about this matter, will this loading/unloading happen automatically, or do we need to do something to enable it? Also, if you could provide me with some links/documentation, I would appreciate it.
@robosina - this is currently a wish list item and a conceptual design, it has not been implemented into Serval. The core technology that would perform the loading/unloading would be https://github.com/allegroai/clearml-serving, which is a layer on top of https://www.nvidia.com/en-us/ai-data-science/products/triton-management-service/. I would review those products for dynamic loading/unloading.
Assuming that we use clearml-serving real time inferencing, we may need to spin our own dynamic loading/unloading algorithm the reason is that the core Triton inference server from NVIDIA does not do it without buying the enterprise plan.
If we were to do this ourselves, we would need to make the explicit calls to the management API and implement a simple algorithm such as: