Closed hanwgyu closed 2 years ago
I'm not able to reproduce the issue, I see the pod scale down to zero when there are no requests, but otherwise stays at 2. Can you share more about your service yaml
and autoscaling config
?
@nader-ziada I used kserve
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "flower-sample"
annotations:
autoscaling.knative.dev/target: "70"
autoscaling.knative.dev/metric: "rps"
spec:
predictor:
minReplicas: 1
maxReplicas: 3
tensorflow:
storageUri: "gs://kfserving-samples/models/tensorflow/flowers"
resources:
limits:
cpu: 10
memory: 10Gi
nvidia.com/gpu: 1
requests:
cpu: 7
memory: 7Gi
nvidia.com/gpu: 1
And I only changed target-burst-capacity
in autoscaling config
as 0.
I'm using knative serving v0.25.2
I have a bunch of issues trying out kserve
, can you try your scenario on knative directly?
@nader-ziada
I found out that it was due to a cold start of the tensorflow serving server. It wasn't a matter of kserve.
When I changed scale-down-delay
to 5m, the replica stayed at 2.
Thank you for your help.
@hanwgyu thanks for answering scale-down-delay
Ask your question here:
I set
autoscaling.knative.dev/target: "70"
, target concurrency is 49 (70 * 0.7).After setting the client qps to 60, the server scales up and down repeatedly.
I expected the number of Pods to remain at 2.
What configuration should I set up for expected behavior?