knative / serving

Kubernetes-based, scale-to-zero, request-driven compute
https://knative.dev/docs/serving/
Apache License 2.0
5.46k stars 1.14k forks source link

KPA and cluster autoscaler compatibility #14939

Open hyde404 opened 4 months ago

hyde404 commented 4 months ago

Ask your question here:

Hello,

I'm setting up an infrastructure based on scale-to-zero, and therefore scale-from-zero too. To do this, we're using the now-familiar "cluster autoscaler", coupled with cluster API (specifically the machineDeployment resource with some annotations).
The node scaling is working fine.

For the moment, I'm just trying to create an "autoscaler-go" knative service, from the cluster where no node is available. The resource is then "pending", which is expected.

NAME                                             READY   STATUS    RESTARTS   AGE
user-service-00001-deployment-6f6d577c45-rtjvz   0/2     Pending   0          1m32s

Here is the configuration I used to create the service:

apiVersion: v1
kind: Namespace
metadata:
  name: 6d2ef157
---
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: user-service
  namespace: 6d2ef157
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/class: kpa.autoscaling.knative.dev
        autoscaling.knative.dev/max-scale: "10"
        autoscaling.knative.dev/min-scale: "1"
        autoscaling.knative.dev/scale-down-delay: "15m"
        autoscaling.knative.dev/window: "240s"
        autoscaling.knative.dev/scale-to-zero-pod-retention-period: "1800s"
      creationTimestamp: null
    spec:
      containerConcurrency: 50
      containers:
      - env:
        - name: TARGET
          value: Sample
        image: ghcr.io/knative/autoscale-go:latest
        name: app
        ports:
        - containerPort: 8080
          protocol: TCP
        readinessProbe:
          successThreshold: 1
          tcpSocket:
            port: 0
        resources:
          limits:
            cpu: "12"
            memory: 78Gi
            nvidia.com/gpu: "1"
          requests:
            cpu: "12"
            memory: 78Gi
            nvidia.com/gpu: "1"
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - CAP_SYS_ADMIN
          runAsNonRoot: true
          runAsUser: 1000
          seccompProfile:
            type: RuntimeDefault
      enableServiceLinks: false
      nodeSelector:
        nvidia.com/gpu.count: "1"
        nvidia.com/gpu.product: NVIDIA-GeForce-RTX-2080-Ti
      runtimeClassName: nvidia
      timeoutSeconds: 1800
      tolerations:
      - effect: NoSchedule
        key: nvidia.com/gpu
        operator: Exists
  traffic:
  - latestRevision: true
    percent: 100

After a few minutes, the pod is still pending, but we get an event that says the cluster autoscaler has been triggered.

Normal   TriggeredScaleUp  2m16s  cluster-autoscaler  pod triggered scale-up: [{MachineDeployment/gpu-nodes 0->1 (max: 30)}]

When the node is available, the pod is created and running.

NAME                                             READY   STATUS    RESTARTS   AGE
user-service-00001-deployment-6f6d577c45-rtjvz   2/2     Running   0          6m7s

However, the service is not ready, and the revision is not created.

NAME           URL                               LATESTCREATED        LATESTREADY   READY   REASON
user-service   http://6d2ef157.some.domain.net   user-service-00001                 False   RevisionMissing
NAME                 CONFIG NAME    K8S SERVICE NAME   GENERATION   READY   REASON          ACTUAL REPLICAS   DESIRED REPLICAS
user-service-00001   user-service                      1            False   Unschedulable   1                 0

This is the events I get from the revision

Warning  InternalError  7m29s  revision-controller  failed to update deployment "user-service-00001-deployment": Operation cannot be fulfilled on deployments.apps "user-service-00001-deployment": the object has been modified; please apply your changes to the latest version and try again
Warning  InternalError  7m29s  revision-controller  failed to update PA "user-service-00001": Operation cannot be fulfilled on podautoscalers.autoscaling.internal.knative.dev "user-service-00001": the object has been modified; please apply your changes to the latest version and try again  

The PodAutoscaler resource is not ready, and the DesiredScale is 0.

NAME                 DESIREDSCALE   ACTUALSCALE   READY   REASON
user-service-00001   0              1             False   NoTraffic

the events from the PodAutoscaler resource

Status:
  Actual Scale:  1
  Conditions:
    Last Transition Time:  2024-02-23T16:32:02Z
    Message:               The target is not receiving traffic.
    Reason:                NoTraffic
    Status:                False
    Type:                  Active
    Last Transition Time:  2024-02-23T16:32:02Z
    Message:               The target is not receiving traffic.
    Reason:                NoTraffic
    Status:                False
    Type:                  Ready
    Last Transition Time:  2024-02-23T16:38:03Z
    Status:                True
    Type:                  SKSReady
    Last Transition Time:  2024-02-23T16:32:02Z
    Status:                True
    Type:                  ScaleTargetInitialized
  Desired Scale:           0
  Metrics Service Name:    user-service-00001-private
  Observed Generation:     2
  Service Name:            user-service-00001

I got error logs from the autoscaler pod

{"severity":"ERROR","timestamp":"2024-02-23T15:55:24.847361414Z","logger":"autoscaler","caller":"podautoscaler/reconciler.go:314","message":"Returned an error","commit":"239b73e","knative.dev/controller":"knative.dev.serving.pkg.reconciler.autoscaling.kpa.Reconciler","knative.dev/kind":"autoscaling.internal.knative.dev.PodAutoscaler","knative.dev/traceid":"2c39855d-329c-43a0-99a9-204f4944e4af","knative.dev/key":"3010eb09/user-service-00001","targetMethod":"ReconcileKind","error":"error scaling target: failed to get scale target {Deployment  user-service-00001-deployment  apps/v1  }: error fetching Pod Scalable 3010eb09/user-service-00001-deployment: deployments.apps \"user-service-00001-deployment\" not found","stacktrace":"knative.dev/serving/pkg/client/injection/reconciler/autoscaling/v1alpha1/podautoscaler.(*reconcilerImpl).Reconcile\n\tknative.dev/serving/pkg/client/injection/reconciler/autoscaling/v1alpha1/podautoscaler/reconciler.go:314\nmain.(*leaderAware).Reconcile\n\tknative.dev/serving/cmd/autoscaler/leaderelection.go:44\nknative.dev/pkg/controller.(*Impl).processNextWorkItem\n\tknative.dev/pkg@v0.0.0-20231023151236-29775d7c9e5c/controller/controller.go:542\nknative.dev/pkg/controller.(*Impl).RunContext.func3\n\tknative.dev/pkg@v0.0.0-20231023151236-29775d7c9e5c/controller/controller.go:491"}
{"severity":"ERROR","timestamp":"2024-02-23T15:55:24.847442144Z","logger":"autoscaler","caller":"controller/controller.go:566","message":"Reconcile error","commit":"239b73e","knative.dev/controller":"knative.dev.serving.pkg.reconciler.autoscaling.kpa.Reconciler","knative.dev/kind":"autoscaling.internal.knative.dev.PodAutoscaler","knative.dev/traceid":"2c39855d-329c-43a0-99a9-204f4944e4af","knative.dev/key":"3010eb09/user-service-00001","duration":"787.035µs","error":"error scaling target: failed to get scale target {Deployment  user-service-00001-deployment  apps/v1  }: error fetching Pod Scalable 3010eb09/user-service-00001-deployment: deployments.apps \"user-service-00001-deployment\" not found","stacktrace":"knative.dev/pkg/controller.(*Impl).handleErr\n\tknative.dev/pkg@v0.0.0-20231023151236-29775d7c9e5c/controller/controller.go:566\nknative.dev/pkg/controller.(*Impl).processNextWorkItem\n\tknative.dev/pkg@v0.0.0-20231023151236-29775d7c9e5c/controller/controller.go:543\nknative.dev/pkg/controller.(*Impl).RunContext.func3\n\tknative.dev/pkg@v0.0.0-20231023151236-29775d7c9e5c/controller/controller.go:491"}

PodAutoscaler resource:

spec:
  containerConcurrency: 50
  protocolType: http1
  reachability: Unreachable
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: user-service-00001-deployment

Changed reachability manually from "Unreachable" to "" and Changed desiredScale manually from "0" to "1"

NAME                 CONFIG NAME    K8S SERVICE NAME   GENERATION   READY   REASON   ACTUAL REPLICAS   DESIRED REPLICAS
user-service-00001   user-service                      1            True             1                 1
NAME           URL                               LATESTCREATED        LATESTREADY          READY   REASON
user-service   http://6d2ef157.some.domain.net   user-service-00001   user-service-00001   True    

The configuration I tried

I started to play with the configuration in an attempt to find the parameter that would unlock everything, but this was not successful. Please note that the values are intentionally exaggerated in an attempt to highlight a pattern.

config-autoscaler:

apiVersion: v1
data:
  allow-zero-initial-scale: "true"
  enable-scale-to-zero: "true"
  initial-scale: "0"
  scale-down-delay: 15m
  scale-to-zero-grace-period: 1800s
  scale-to-zero-pod-retention-period: 1800s
  stable-window: 360s
  target-burst-capacity: "211"
  window: 240s
kind: ConfigMap
metadata:
  name: config-autoscaler
  namespace: knative-serving

config-deployment:

apiVersion: v1
data:
  progress-deadline: 3600s
  queue-sidecar-image: gcr.io/knative-releases/knative.dev/serving/cmd/queue@sha256:d569f30abd31cbe105ba32b512a321dd82431b0a8e205bebf14538fddb4dfa54
  queueSidecarImage: gcr.io/knative-releases/knative.dev/serving/cmd/queue@sha256:9b8dad0630029dfcab124e6b4fa7c8e39b453249f0b31282c48e008bfc16faa3
kind: ConfigMap
metadata:
  name: config-deployment
  namespace: knative-serving

config-defaults:

apiVersion: v1
data:
  max-revision-timeout-seconds: "3600"
  revision-response-start-timeout-seconds: "1800"
  revision-timeout-seconds: "1800"
kind: ConfigMap
metadata:
  name: config-defaults
  namespace: knative-serving

The problem I'm facing

I'm not sure what I'm doing wrong. It looks like the revision has no reconciler, but I'm not sure.
The pod is running and the service is created, but the revision is not, which is why the service is not ready, and it's a bit of a mystery.

Could you please help me understand what is wrong with my configuration?

JunfeiZhang commented 3 months ago

Hi @hyde404 we are facing the same issue. have you resolved this issue?

github-actions[bot] commented 2 weeks ago

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

skonto commented 6 days ago

/remove-lifecycle stale

skonto commented 6 days ago

Hi @hyde404, The object has been modified; please apply your changes to the latest version is a transient error, you can ignore it.

Do you have a setup so I can try reproduce the issue? It seems that the deployment is missing for some reason. Did you update your service somehow when the autoscaler kicked in?