HPA does not update data as Keda Operator does

APuertaSales commented 4 years ago

What happened: I configured a scaledobject for kafka and it is not updating the HPA info. This is the scaledobject configuration:

spec:
  cooldownPeriod: 300
  maxReplicaCount: 5
  minReplicaCount: 1
  pollingInterval: 30
  scaleTargetRef:
    deploymentName: absolutegrounds-helper-processors
  triggers:
  - metadata:
      brokerList: bootstrap.kafka11:9092
      consumerGroup: int.absolutegrounds.helper.processor.datapipeline
      lagThreshold: "500"
      topic: INT-AG_TASK_SOURCE_DP
    type: kafka

Adding debug log level shows that the consumer has a certain lag: {"level":"debug","ts":1581350124.0215647,"logger":"kafka_scaler","msg":"Group int.absolutegrounds.helper.processor.datapipeline has a lag of 7931 for topic INT-AG_TASK_SOURCE_DP and partition 2\n"}

But the HPA created shows this information: NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE keda-hpa-absolutegrounds-helper-processors Deployment/absolutegrounds-helper-processors 500/500 (avg) 1 5 5 3h35m

With this info:

kind: HorizontalPodAutoscaler
metadata:
  annotations:
    autoscaling.alpha.kubernetes.io/conditions: '[{"type":"AbleToScale","status":"True","lastTransitionTime":"2020-02-10T12:41:40Z","reason":"ReadyForNewScale","message":"recommended
      size matches current size"},{"type":"ScalingActive","status":"True","lastTransitionTime":"2020-02-10T12:41:40Z","reason":"ValidMetricFound","message":"the
      HPA was able to successfully calculate a replica count from external metric
      lagThreshold(\u0026LabelSelector{MatchLabels:map[string]string{deploymentName:
      absolutegrounds-helper-processors,},MatchExpressions:[],})"},{"type":"ScalingLimited","status":"False","lastTransitionTime":"2020-02-10T13:03:38Z","reason":"DesiredWithinRange","message":"the
      desired count is within the acceptable range"}]'
    autoscaling.alpha.kubernetes.io/current-metrics: '[{"type":"External","external":{"metricName":"lagThreshold","metricSelector":{"matchLabels":{"deploymentName":"absolutegrounds-helper-processors"}},"currentValue":"0","currentAverageValue":"500"}}]'
    autoscaling.alpha.kubernetes.io/metrics: '[{"type":"External","external":{"metricName":"lagThreshold","metricSelector":{"matchLabels":{"deploymentName":"absolutegrounds-helper-processors"}},"targetAverageValue":"500"}}]'
  creationTimestamp: "2020-02-10T12:23:13Z"
  labels:
    app.kubernetes.io/managed-by: keda-operator
    app.kubernetes.io/name: keda-hpa-absolutegrounds-helper-processors
    app.kubernetes.io/part-of: helpers-absolutegrounds-processor-intadaptive-lag
    app.kubernetes.io/version: 1.2.0
  name: keda-hpa-absolutegrounds-helper-processors
  namespace: intadaptive-cb
  ownerReferences:
  - apiVersion: keda.k8s.io/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: ScaledObject
    name: helpers-absolutegrounds-processor-intadaptive-lag
    uid: 1b5e600d-4c00-11ea-8a9e-005056a2317c
  resourceVersion: "388821268"
  selfLink: /apis/autoscaling/v1/namespaces/intadaptive-cb/horizontalpodautoscalers/keda-hpa-absolutegrounds-helper-processors
  uid: 1bb53f18-4c00-11ea-9fd9-ecebb8956d60
spec:
  maxReplicas: 5
  minReplicas: 1
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: absolutegrounds-helper-processors
status:
  currentReplicas: 5
  desiredReplicas: 5
  lastScaleTime: "2020-02-10T13:03:39Z"

What you expected to happen: Something like 7931/500 (avg) in the HPA. Now it says that currentValue is 0 but currentAverage Value is 500 ¿¿?¿?¿?¿ and this for a long time.

Anything else we need to know?: Noticed that the currentReplicas and desiredReplicas info is not updated:

keda-hpa-cancellation-helper-processors      Deployment/cancellation-helper-processors      0/500 (avg)     1         5         4          3h42m
cancellation-helper-processors-7f56f97c84-b6h2h                   1/1     Running            0          7d

Environment:

Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.3", GitCommit:"721bfa751924da8d1680787490c54b9179b1fed0", GitTreeState:"clean", BuildDate:"2019-02-01T20:00:57Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}

Keda version 1.2.0

zroubalik commented 4 years ago

@APuertaSales Could you please format the ScaledObject and HPA posted above, so it is more readible? eg. https://help.github.com/en/github/writing-on-github/creating-and-highlighting-code-blocks#fenced-code-blocks

APuertaSales commented 4 years ago

Of course, sorry!

zroubalik commented 4 years ago

@ppatierno might have an idea?

ppatierno commented 4 years ago

How many partitions have your topic?

APuertaSales commented 4 years ago

Keda log only shows 1 partition in the example given: "Group int.absolutegrounds.helper.processor.datapipeline has a lag of 7931 for topic INT-AG_TASK_SOURCE_DP and partition 2\n"

Kafka Tool shows 5 partitions:

And so does Kafka Manager:

ppatierno commented 4 years ago

well that log is not clear because it's in a for loop which breaks as soon as there is a lag higher than the lagThreshold. So it just says that you have lag 7931 on partition 2 but maybe you could have other lags on the other partitions (which is not your case from the kafka tool output). I think we should change this log somehow.

Regarding the value showed by HPA 500/500 I would expect more 2500 due to this snippet code:

// don't scale out beyond the number of partitions
    if (totalLag / s.metadata.lagThreshold) > int64(len(partitions)) {
        totalLag = int64(len(partitions)) * s.metadata.lagThreshold
    }

    metric := external_metrics.ExternalMetricValue{
        MetricName: metricName,
        Value:      *resource.NewQuantity(int64(totalLag), resource.DecimalSI),
        Timestamp:  metav1.Now(),
    }

it drops the totalLag to the number of partitions (otherwise more consumers would be idle). So I would expect totalLag = 5 * 500 and this value is passed as external metric value for the HPA. Strange it reports 500/500 ...

Anyway, you have the correct number of consumer instances which is 5 (because of your maxReplicaCount but even due to the 5 partitions).

APuertaSales commented 4 years ago

Thanks @ppatierno, this explains everything. Sorry I was expecting exact values but as you say it has no sense to run idle consumers. Do you have another explanation to the currentReplicas and desiredReplicas info that is not aligned with the real number of replicas of the deployment managed? This was the info about the HPA:

NAME                                       REFERENCE                                          TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
keda-hpa-cancellation-helper-processors      Deployment/cancellation-helper-processors      0/500 (avg)     1         5         4          3h42m

And this was the list of PODs, only 1 deployed, not 4:

NAME                                                              READY   STATUS              RESTARTS   AGE
cancellation-helper-processors-7f56f97c84-b6h2h                   1/1     Running            0          7d

The HPA was true, the target was correct, the number of replicas diminished to the minpods value, but the number of replicas shown was the one reached when the lag was over the target of the HPA. It was not updated to match the real number of replicas until quite a long time. Thanks for your help!!!!!!!

ppatierno commented 4 years ago

@APuertaSales tbh keda should be not involved on updating the hpa values, it's all about Kubernetes. I have no clue right now.

APuertaSales commented 4 years ago

Thanks @ppatierno, You are right, but is strange because other HPAs we have configured do not show this misbehavior. We will continue using your application while monitoring the results. It is far easier to work with your solution than with our previous one. Regards, Alberto.

kedacore / keda

HPA does not update data as Keda Operator does #623