Open michael-nammi opened 7 months ago
Hi, @michael-nammi, have you found the solution ?
I found this doc from the model-mesh repository. Hope this will help.
I'm also trying to understand how to set replicas to a specific, fixed number. It seems model autoscaling is on by default, so I'm not sure if that is possible. Maybe it's only possible by setting minReplicas=maxReplicas
?
I'm also trying to understand how to set replicas to a specific, fixed number. It seems model autoscaling is on by default, so I'm not sure if that is possible. Maybe it's only possible by setting
minReplicas=maxReplicas
?
Unfortunately, there is no way to set a fixed number of replicas of a certain model, you may only control it indirectly via concurrency settings of the serving runtime servers. As far as I know, there 2 logics of model scaling:
So, there's no way to stick with one replica!? I'm observing that most of the time I have 2 replicas of the model (MM v0.12.0). That's not great when deploying LLMs. Setting maxReplicas=1
has no effect, right?
So, there's no way to stick with one replica!? I'm observing that most of the time I have 2 replicas of the model (MM v0.12.0). That's not great when deploying LLMs. Setting
maxReplicas=1
has no effect, right?
It seems I had 2 Prometheus jobs collecting the same metrics and after aggregating got duplicated results. I have 7 models and 7 copies, actually.
The idea here is that Model Mesh control the number of replicas of the serving runtimes, not the models. You can definitely set the maxReplicas
of the serving runtimes.
The fact that your model scaled up, that means there was available capacity in your serving runtimes. Scaling up in that case makes sense to me. If you want to prevent that scaling up, I think that you can set the maxReplicas
or increase the scaling up threshold of the model.
By the way, I have just deployed modelmesh on my staging environment in my company, although we are gonna deploy it on production environment soon, perhaps there is something I miss about how modelmesh control the number of replicas (that thing baffled me a lot).
Description
I am working with ModelMesh Serving deployed on a Kubernetes cluster and I am looking for a way to control the number of replicas for a specific model. My setup includes a Triton runtime with two pods, and I'm serving a model mobilenet. I aim to ensure that the model replicas can be configured to a specific number.
Cluster State:
The state of pods in my cluster is as follows:
Inference service status
The InferenceService for mobilenet (example-mobilenet-isvc) has minReplicas set to 2, as shown in the description below:
ETCD Keys and Values:
Relevant data from ETCD suggests only one replica is active for the model as per the instanceIds and count:
Question:
How can one ensure that ModelMesh Serving adheres to the minReplicas configuration for a specific model? The documentation does not seem to discuss in depth about scaling individual model replicas across the serving pods. Is there a way to control the model replicas in modelmesh serving?