knative / serving

Kubernetes-based, scale-to-zero, request-driven compute
https://knative.dev/docs/serving/
Apache License 2.0
5.54k stars 1.15k forks source link

When Autoscaling with rps, scale up and down repeatedly #12765

Closed hanwgyu closed 2 years ago

hanwgyu commented 2 years ago

Ask your question here:

I set autoscaling.knative.dev/target: "70" , target concurrency is 49 (70 * 0.7).

After setting the client qps to 60, the server scales up and down repeatedly.

I expected the number of Pods to remain at 2.

What configuration should I set up for expected behavior?

image

test

nader-ziada commented 2 years ago

I'm not able to reproduce the issue, I see the pod scale down to zero when there are no requests, but otherwise stays at 2. Can you share more about your service yaml and autoscaling config?

hanwgyu commented 2 years ago

@nader-ziada I used kserve

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "flower-sample"
  annotations:
    autoscaling.knative.dev/target: "70"
    autoscaling.knative.dev/metric: "rps"
spec:
  predictor:
    minReplicas: 1
    maxReplicas: 3
    tensorflow:
      storageUri: "gs://kfserving-samples/models/tensorflow/flowers"
      resources:
        limits:
          cpu: 10
          memory: 10Gi
          nvidia.com/gpu: 1
        requests:
          cpu: 7
          memory: 7Gi
          nvidia.com/gpu: 1

And I only changed target-burst-capacity in autoscaling config as 0.

I'm using knative serving v0.25.2

nader-ziada commented 2 years ago

I have a bunch of issues trying out kserve, can you try your scenario on knative directly?

hanwgyu commented 2 years ago

@nader-ziada I found out that it was due to a cold start of the tensorflow serving server. It wasn't a matter of kserve. When I changed scale-down-delay to 5m, the replica stayed at 2.

Thank you for your help.

ssilb4 commented 8 months ago

@hanwgyu thanks for answering scale-down-delay