kserve / modelmesh-serving

Controller for ModelMesh
Apache License 2.0
203 stars 114 forks source link

Auto Scaling Model Mesh Custom Runtime Service #331

Closed sidharthkumarpradhan closed 7 months ago

sidharthkumarpradhan commented 1 year ago

Issue-1: We are trying to to autoscale the custom deployed runtime,, we have tried to specify the annotation and predictor parameters in Inference service manifest, but the scaling is not happening

example:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: demo
  namespace: model-mesh
  annotations:
    autoscaling.knative.dev/target: "10"
spec:
  predictor:
     minReplicas: 0
     containerConcurrency: 1
     scaleTarget: 10
     scaleMetric: concurrency
    model:
      modelFormat:
        name: custom-grpc-01
        version: "1"
      storageUri: <>
      runtime: muc-en-adv-mlserver-grpc-1.x.1

Issue-2:

Then we have tried to scale the custom runtime deployment by creating a HPA(Horizontal Pod Autoscaler), but though the Runtime pods are getting spun up , we are not able to distribute load across all the pods (Load Balancing is not happening), and pods are getting terminated as soon as it's up. Below is the HPA manifest that we are using.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: model-mesh-es-adv-autoscaler
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: modelmesh-serving-muc-en-adv-mlserver-grpc-1.x.1
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50

Kindly help us to figure out an Autoscalin solution for Custom Runtime Model Mesh. Thank you.

njhill commented 1 year ago

Hi @sidharthkumarpradhan ... model-mesh does not work with KNative, regular kube deployments are created/managed by the modelmesh-serving controller.

Because it was designed to manage large numbers of smallish models, the autoscaling happens by loading/unloading copies of models within a static set of pods. You can configure how many pods (per runtime), but it's not dynamic currently. It will scale them to zero however if you don't have any InferenceServices created that need the particular runtime.

So currently it is not compatible with HPA .. the HPA would fight the mm-serving controller to set the replica count. There are plans to add a config option to make this possible soon however, see https://github.com/kserve/modelmesh-serving/issues/329

sidharthkumarpradhan commented 1 year ago

Thanks Njhill for you valuable inputs, could you kindly tell us when the "HPA" will be available? It will be really helpful for our case. In the mean time could you kindly give us a solution for the Scaling, if it is possible by any means. Thank you.