Open MLHafizur opened 1 year ago
@MLHafizur -- do I understand you correctly, that in contrast to the existing Scale to Zero capabilities in ModelMesh:
If a given ServingRuntime has no InferenceServices that it supports, the Deployment for that runtime can safely be scaled to 0 replicas to save on resources. By enabling ScaleToZero in the configuration, ModelMesh Serving will perform this scaling automatically. If an InferenceService is later added that requires the runtime, it will be scaled back up.
You would like to trigger a scale to zero even if there still are models loaded, if those models have no traffic for a given length of time? So, the new capability would be "unload model after certain period of inactivity"?
@ckadner Yes, exactly, Is it do able?
@njhill @tjohnson31415 -- what do you think?
I agree strongly with this statement 😄 :
I understand that implementing this feature may require careful consideration and planning, as well as potential changes to the underlying architecture
Hey guys, since we would like to achieve the same result, I was wondering if it would be possible to integrate with keda leveraging the etcd scaler??
@andreapairon -- let's use this thread to continue the conversation from issue #388
Writing to increase visibility for this post.
Triton metrics like nv_inference_count
or nv_inference_request_duration_us
certainly provide idle model information therefore writing an unloader to unload idle models after Xm of inactivity should be somewhat straightforward no?
I am currently evaluating ModelMesh Serving for one of my projects and have come across a useful feature in KServe that I would like to see implemented in ModelMesh Serving as well. The feature in question is the ability to scale down to zero, which allows for efficient resource usage when there is no incoming traffic. You can find more information about this feature in KServe here: KServe
Feature Description:
Motivation:
Implementing this feature in ModelMesh Serving would offer several benefits, including:
Improved resource efficiency, as idle model replicas would not consume resources when they are not in use.
Lower costs for users who deploy ModelMesh Serving on cloud infrastructure or other platforms where resources are billed based on usage.
Enhanced flexibility, as users can optimize their ModelMesh Serving deployments to fit the specific traffic patterns and requirements of their use case.
I understand that implementing this feature may require careful consideration and planning, as well as potential changes to the underlying architecture. However, I believe that it would be a valuable addition to ModelMesh Serving and could significantly improve the user experience and adoption of the project.