How to Control the Number of Model Replicas in ModelMesh Serving

michael-nammi commented 7 months ago

Description

I am working with ModelMesh Serving deployed on a Kubernetes cluster and I am looking for a way to control the number of replicas for a specific model. My setup includes a Triton runtime with two pods, and I'm serving a model mobilenet. I aim to ensure that the model replicas can be configured to a specific number.

Cluster State:

The state of pods in my cluster is as follows:

NAME                                           READY   STATUS    RESTARTS   AGE
etcd-bcc445f46-gnmw6                           1/1     Running   0          2d21h
minio-67577699d-frm4s                          1/1     Running   0          2d21h
modelmesh-controller-5fd6b98c4f-h4njm          1/1     Running   0          65s
modelmesh-serving-triton-2.x-9849f97c6-54gh7   4/4     Running   0          18s
modelmesh-serving-triton-2.x-9849f97c6-qndvd   4/4     Running   0          18s
traefik-78db748568-cmn4x                       1/1     Running   0          2d21h

Inference service status

NAME                     URL                                               READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION   AGE
example-mobilenet-isvc   grpc://modelmesh-serving.modelmesh-serving:8033   True
                             40s

The InferenceService for mobilenet (example-mobilenet-isvc) has minReplicas set to 2, as shown in the description below:

Name:         example-mobilenet-isvc
Namespace:    modelmesh-serving
Labels:       <none>
Annotations:  serving.kserve.io/deploymentMode: ModelMesh
API Version:  serving.kserve.io/v1beta1
Kind:         InferenceService
Metadata:
  Creation Timestamp:  2024-04-18T03:01:25Z
  Generation:          1
  Resource Version:    454691
  UID:                 f34abe33-606f-4fbd-95e4-a67829f7dac0
Spec:
  Predictor:
    Min Replicas:  2
    Model:
      Model Format:
        Name:   onnx
      Runtime:  triton-2.x
      Storage:
        Key:  minio
        Parameters:
          Bucket:  modelmesh-serving
        Path:      mobilenetv2-7.onnx
Status:
  Components:
    Predictor:
      Grpc URL:  grpc://modelmesh-serving.modelmesh-serving:8033
      Rest URL:  http://modelmesh-serving.modelmesh-serving:8008
      URL:       grpc://modelmesh-serving.modelmesh-serving:8033
  Conditions:
    Last Transition Time:  2024-04-18T03:01:40Z
    Status:                True
    Type:                  PredictorReady
    Last Transition Time:  2024-04-18T03:01:40Z
    Status:                True
    Type:                  Ready
  Model Status:
    Copies:
      Failed Copies:  0
      Total Copies:   1
    States:
      Active Model State:  Loaded
      Target Model State:
    Transition Status:     UpToDate
  URL:                     grpc://modelmesh-serving.modelmesh-serving:8033
Events:                    <none>

ETCD Keys and Values:

Relevant data from ETCD suggests only one replica is active for the model as per the instanceIds and count:

{"hostname":"10.244.0.174","instanceId":"9f97c6-qndvd","port":8080,"version":"20230801-7b484","registrationTime":1713409288745,"connConfig":{"transport.tprotocol.factory":"org.apache.thrift.protocol.TCompactProtocol$Factory","transport.framed":"false","transport.ssl.enabled":"false","transport.extrainfo_supported":"true","service.class":"com.ibm.watson.modelmesh.thrift.ModelMeshService","methodinfo.applyModelMulti":"idp=t","methodinfo.applyModel":"idp=t","app.kv_store_type":"etcd"}}
/litelinks/modelmesh-serving/10.244.0.175_8080_18eef26ea30
{"hostname":"10.244.0.175","instanceId":"9f97c6-54gh7","port":8080,"version":"20230801-7b484","registrationTime":1713409288755,"connConfig":{"transport.tprotocol.factory":"org.apache.thrift.protocol.TCompactProtocol$Factory","transport.framed":"false","transport.ssl.enabled":"false","transport.extrainfo_supported":"true","service.class":"com.ibm.watson.modelmesh.thrift.ModelMeshService","methodinfo.applyModel":"idp=t","methodinfo.applyModelMulti":"idp=t","app.kv_store_type":"etcd"}}
/mm/modelmesh-serving/instances/9f97c6-54gh7
{"startTime":1713409287610,"loc":"172.18.0.2","labels":["mt:keras","mt:keras:2","mt:onnx","mt:onnx:1","mt:pytorch","mt:pytorch:1","mt:tensorflow","mt:tensorflow:1","mt:tensorflow:2","mt:tensorrt","mt:tensorrt:7","pv:grpc-v2","pv:v2","rt:triton-2.x"],"actionable":true,"lruTime":1713407522245,"count":1,"cap":48661,"used":123,"lThreads":2,"lInProg":1}
/mm/modelmesh-serving/instances/9f97c6-qndvd
{"startTime":1713409287621,"loc":"172.18.0.2","labels":["mt:keras","mt:keras:2","mt:onnx","mt:onnx:1","mt:pytorch","mt:pytorch:1","mt:tensorflow","mt:tensorflow:1","mt:tensorflow:2","mt:tensorrt","mt:tensorrt:7","pv:grpc-v2","pv:v2","rt:triton-2.x"],"actionable":true,"lruTime":1713407522368,"count":1,"cap":48661,"used":2174,"lThreads":2}
/mm/modelmesh-serving/leaderLatch/_9f97c6-54gh7
_9f97c6-54gh7
/mm/modelmesh-serving/leaderLatch/_9f97c6-qndvd
_9f97c6-qndvd
/mm/modelmesh-serving/registry/example-mobilenet-isvc__isvc-0b5941bbd0
{"type":"rt:triton-2.x","encKey":"{\"storage_key\":\"minio\",\"storage_params\":{\"bucket\":\"modelmesh-serving\"},\"model_type\":{\"name\":\"onnx\"}}","mPath":"mobilenetv2-7.onnx","autoDel":true,"instanceIds":{"9f97c6-qndvd":1713409297527},"refs":1,"lu":1713407522368}
/mm/modelmesh-serving/vmodels/example-mobilenet-isvc
{"o":"isvc","amid":"example-mobilenet-isvc__isvc-0b5941bbd0","tmid":"example-mobilenet-isvc__isvc-0b5941bbd0"}

Question:

How can one ensure that ModelMesh Serving adheres to the minReplicas configuration for a specific model? The documentation does not seem to discuss in depth about scaling individual model replicas across the serving pods. Is there a way to control the model replicas in modelmesh serving?

haiminh2001 commented 4 months ago

Hi, @michael-nammi, have you found the solution ?

haiminh2001 commented 2 months ago

I found this doc from the model-mesh repository. Hope this will help.

mafs12 commented 1 month ago

I'm also trying to understand how to set replicas to a specific, fixed number. It seems model autoscaling is on by default, so I'm not sure if that is possible. Maybe it's only possible by setting minReplicas=maxReplicas?

haiminh2001 commented 1 month ago

I'm also trying to understand how to set replicas to a specific, fixed number. It seems model autoscaling is on by default, so I'm not sure if that is possible. Maybe it's only possible by setting minReplicas=maxReplicas?

Unfortunately, there is no way to set a fixed number of replicas of a certain model, you may only control it indirectly via concurrency settings of the serving runtime servers. As far as I know, there 2 logics of model scaling:

scaling from 0/1 to 2 copies for HA. That logic is described here
scaling from 2 ->n copies. That logic is controlled by the concurrency settings of serving runtime servers. There are 2 options: RPS based and latency based. You may find more information here: https://github.com/kserve/modelmesh/blob/main/docs/configuration/scaling.md

mafs12 commented 1 month ago

So, there's no way to stick with one replica!? I'm observing that most of the time I have 2 replicas of the model (MM v0.12.0). That's not great when deploying LLMs. Setting maxReplicas=1 has no effect, right?

mafs12 commented 1 month ago

So, there's no way to stick with one replica!? I'm observing that most of the time I have 2 replicas of the model (MM v0.12.0). That's not great when deploying LLMs. Setting maxReplicas=1 has no effect, right?

It seems I had 2 Prometheus jobs collecting the same metrics and after aggregating got duplicated results. I have 7 models and 7 copies, actually.

haiminh2001 commented 1 month ago

The idea here is that Model Mesh control the number of replicas of the serving runtimes, not the models. You can definitely set the maxReplicas of the serving runtimes.

The fact that your model scaled up, that means there was available capacity in your serving runtimes. Scaling up in that case makes sense to me. If you want to prevent that scaling up, I think that you can set the maxReplicas or increase the scaling up threshold of the model.

By the way, I have just deployed modelmesh on my staging environment in my company, although we are gonna deploy it on production environment soon, perhaps there is something I miss about how modelmesh control the number of replicas (that thing baffled me a lot).

kserve / modelmesh-serving