External scaler connection errors ignored, the HPA is missing metrics

vrok commented 4 months ago

Report

When I install a Helm chart containing both an external scaler GRPC service and a ScaledObject, the resulting HPA has an empty list of metrics (K8s inserts the default 80% CPU utilization metric in that case). It then remains in that state even after the external scaler GRPC service has been initialized (I can manually force it to re-reconcile by editing the ScaledObject).

This is happening because Helm installs the external scaler service and the ScaledObject at the same time. The external scaler's GRPC server isn't available immediately (it takes ~1 sec for the pod to start), and KEDA runs the reconciliation of the ScaledObject before the external scaler is available, ignoring the GRPC connection error.

Expected Behavior

In my opinion, it would probably be better if KEDA were to re-queue the reconciliation request in these situations. For example, Reconcile() in scaledobject_controller.go could be returning ctrl.Result{RequeueAfter: time.Minute} if a GRPC connection error was observed.

Actual Behavior

KEDA doesn't update the HPA even after the external scaler is available.

Steps to Reproduce the Problem

Install a ScaledObject resource using an external scaler
Install the external scaler's GRPC service (1 and 2 should happen roughly at the same time, e.g., by being installed as part of the same Helm chart)
Now notice that the HorizonalPodAutoscaler created by KEDA is missing the metric specified in the ScaledObject

Logs from KEDA operator

No response

KEDA Version

None

Kubernetes Version

None

Platform

Any

Scaler Details

No response

Anything else?

No response

JorTurFer commented 4 months ago

Hello, What KEDA version are you using? this error shouldn't happen because KEDA tries to reconcille the ScaledObjects automatically. Do you see any error in KEDA ooperator logs?

vrok commented 4 months ago

@JorTurFer I'm on 2.14.0 (but I tested the main branch yesterday and the problem occurred too).

This is the ScaledObject definition:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  labels:
    app.kubernetes.io/managed-by: Helm
    scaledobject.keda.sh/name: scaledobject-workers
  name: scaledobject-workers
  namespace: default
spec:
  scaleTargetRef:
    kind: Deployment
    name: scheduler
  triggers:
  - metadata:
      scalerAddress: scheduler-scaler.default.svc.cluster.local:8080
    type: external-push

And this is the HPA that gets created - notice that the list of metrics only contains a CPU-based metric (this is the default one inserted by K8s):

apiVersion: v1
items:
- apiVersion: autoscaling/v2
  kind: HorizontalPodAutoscaler
  metadata:
    annotations:
      meta.helm.sh/release-name: scheduler
      meta.helm.sh/release-namespace: default
    creationTimestamp: "2024-05-08T14:49:33Z"
    labels:
      app.kubernetes.io/managed-by: Helm
      app.kubernetes.io/name: keda-hpa-scaledobject-workers
      app.kubernetes.io/part-of: scaledobject-workers
      app.kubernetes.io/version: 2.14.0
      scaledobject.keda.sh/name: scaledobject-workers
    name: keda-hpa-scaledobject-workers
    namespace: default
    ownerReferences:
    - apiVersion: keda.sh/v1alpha1
      blockOwnerDeletion: true
      controller: true
      kind: ScaledObject
      name: scaledobject-workers
      uid: 1c21176d-71bc-4de2-9740-9fe03f5f66d7
    resourceVersion: "2777064"
    uid: a272a347-f011-499f-92e5-fa08d650f985
  spec:
    maxReplicas: 100
    metrics:
    - resource:
        name: cpu
        target:
          averageUtilization: 80
          type: Utilization
      type: Resource
    minReplicas: 1
    scaleTargetRef:
      apiVersion: apps/v1
      kind: Deployment
      name: scheduler
  status:
    conditions:
    - lastTransitionTime: "2024-05-08T14:49:48Z"
      message: the HPA controller was able to get the target's current scale
      reason: SucceededGetScale
      status: "True"
      type: AbleToScale
    - lastTransitionTime: "2024-05-08T14:49:48Z"
      message: 'the HPA was unable to compute the replica count: failed to get cpu
        utilization: unable to get metrics for resource cpu: unable to fetch metrics
        from resource metrics API: the server could not find the requested resource
        (get pods.metrics.k8s.io)'
      reason: FailedGetResourceMetric
      status: "False"
      type: ScalingActive
    currentMetrics: null
    currentReplicas: 1
    desiredReplicas: 0
kind: List
metadata:
  resourceVersion: ""

I'm also attaching logs from the operator pod:

keda-operator-logs.txt

Now, for example, if I edit the ScaledObject (with kubectl edit scaledobject ...), KEDA's Reconcile() method in scaledobject_controller.go will be re-run and update the HPA resource with the expected changes. It seems to be happening because the GRPC connection error is ignored when the GRPC service isn't available yet, and when it becomes available, KEDA doesn't retry the GRPC call.

JorTurFer commented 3 months ago

I'm going to try to reproduce this. From your example, I understand that I can deploy the ScaledObject and then, after some seconds, the external gRPC server and it'd be almost your use case, right? I want to find where we are hiding the connection error

vrok commented 3 months ago

@JorTurFer That's correct, the gRPC server with the external scaler should be down for some time after a ScaledObject is installed

stale[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

kedacore / keda