Scale-to-Zero Functionality Similar to KServe

MLHafizur commented 1 year ago

I am currently evaluating ModelMesh Serving for one of my projects and have come across a useful feature in KServe that I would like to see implemented in ModelMesh Serving as well. The feature in question is the ability to scale down to zero, which allows for efficient resource usage when there is no incoming traffic. You can find more information about this feature in KServe here: KServe

Feature Description:

Scale-to-zero functionality that enables ModelMesh Serving to automatically scale down to zero replicas when there is no traffic or requests for a certain period.
Customizable configuration settings to adjust the duration of inactivity required before scaling down, as well as any other relevant autoscaling parameters.

Motivation:

Implementing this feature in ModelMesh Serving would offer several benefits, including:

Improved resource efficiency, as idle model replicas would not consume resources when they are not in use.
Lower costs for users who deploy ModelMesh Serving on cloud infrastructure or other platforms where resources are billed based on usage.
Enhanced flexibility, as users can optimize their ModelMesh Serving deployments to fit the specific traffic patterns and requirements of their use case.

I understand that implementing this feature may require careful consideration and planning, as well as potential changes to the underlying architecture. However, I believe that it would be a valuable addition to ModelMesh Serving and could significantly improve the user experience and adoption of the project.

ckadner commented 1 year ago

@MLHafizur -- do I understand you correctly, that in contrast to the existing Scale to Zero capabilities in ModelMesh:

If a given ServingRuntime has no InferenceServices that it supports, the Deployment for that runtime can safely be scaled to 0 replicas to save on resources. By enabling ScaleToZero in the configuration, ModelMesh Serving will perform this scaling automatically. If an InferenceService is later added that requires the runtime, it will be scaled back up.

You would like to trigger a scale to zero even if there still are models loaded, if those models have no traffic for a given length of time? So, the new capability would be "unload model after certain period of inactivity"?

MLHafizur commented 1 year ago

@ckadner Yes, exactly, Is it do able?

ckadner commented 1 year ago

@njhill @tjohnson31415 -- what do you think?

tjohnson31415 commented 1 year ago

I agree strongly with this statement 😄 :

I understand that implementing this feature may require careful consideration and planning, as well as potential changes to the underlying architecture

dodo-robot commented 1 year ago

Hey guys, since we would like to achieve the same result, I was wondering if it would be possible to integrate with keda leveraging the etcd scaler??

https://keda.sh/docs/2.9/scalers/etcd/

ckadner commented 1 year ago

@andreapairon -- let's use this thread to continue the conversation from issue #388

kyriakost commented 8 months ago

Writing to increase visibility for this post. Triton metrics like nv_inference_count or nv_inference_request_duration_us certainly provide idle model information therefore writing an unloader to unload idle models after Xm of inactivity should be somewhat straightforward no?

kserve / modelmesh-serving

Scale-to-Zero Functionality Similar to KServe #353