[Kafka]External metrics server not reporting the correct value to scale up

parjun8840 commented 2 years ago

A clear and concise description of what the bug is.

Expected Behavior

I have defined- lagThreshold: "10".

`apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: name: kafka-scaledobject namespace: kafka labels: deploymentName: kafka-consumer-deployment # Required Name of the deployment we want to scale. spec: scaleTargetRef: name: kafka-ap pollingInterval: 5 minReplicaCount: 1 #Optional Default 0 maxReplicaCount: 3 #Optional Default 100 triggers:

type: kafka metadata: bootstrapServers: 10.100.117.76:9092 consumerGroup: order-shipper topic: preorder lagThreshold: "10"`

I have pushed around "33" messages, which caused a lag of "33". With the lag as "33" or any value above "10" on the Kafka consumer group. It should scale up the Pods.

`sh-4.4$ ./kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group order-shipper

GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID order-shipper preorder 0 55 88 33 kafka-python-2.0.2-cc83bb42-afe7-4f34-b37b-d944859356b6 /192.168.171.85 kafka-python-2.0.2 sh-4.4$`

API request: "/apis/external.metrics.k8s.io/v1beta1/namespaces/kafka/s0-kafka-preorder" doesn't report value more than "10"

{"kind":"ExternalMetricValueList","apiVersion":"external.metrics.k8s.io/v1beta1","metadata":{},"items":[{"metricName":"s0-kafka-preorder","metricLabels":null,"timestamp":"2022-08-25T22:59:38Z","value":"10"}]}

In the above it should be {"kind":"ExternalMetricValueList","apiVersion":"external.metrics.k8s.io/v1beta1","metadata":{},"items":[{"metricName":"s0-kafka-preorder","metricLabels":null,"timestamp":"2022-08-25T22:59:38Z","value":"33"}]}

ScaledObject:

arjunpandey$ kubectl get scaledobject -nkafka
NAME                 SCALETARGETKIND      SCALETARGETNAME   MIN   MAX   TRIGGERS   AUTHENTICATION   READY   ACTIVE   FALLBACK   AGE
kafka-scaledobject   apps/v1.Deployment   kafka-ap          1     3     kafka                       True    True     False      27m

HPA `arjunpandey$ kubectl describe hpa -nkafka Name: keda-hpa-kafka-scaledobject Namespace: kafka Labels: app.kubernetes.io/managed-by=keda-operator app.kubernetes.io/name=keda-hpa-kafka-scaledobject app.kubernetes.io/part-of=kafka-scaledobject app.kubernetes.io/version=2.7.1 deploymentName=kafka-consumer-deployment scaledobject.keda.sh/name=kafka-scaledobject Annotations: CreationTimestamp: Fri, 26 Aug 2022 07:32:51 +0900 Reference: Deployment/kafka-ap Metrics: ( current / target ) "s0-kafka-preorder" (target average value): 10 / 10 Min replicas: 1 Max replicas: 3 Deployment pods: 1 current / 1 desired Conditions: Type Status Reason Message .---- ------ ------ ------- AbleToScale True ReadyForNewScale recommended size matches current size ScalingActive True ValidMetricFound the HPA was able to successfully calculate a replica count from external metric s0-kafka-preorder(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: kafka-scaledobject,},MatchExpressions:[]LabelSelectorRequirement{},}) ScalingLimited False DesiredWithinRange the desired count is within the acceptable range Events: Type Reason Age From Message

Normal SuccessfulRescale 20m horizontal-pod-autoscaler New size: 2; reason: external metric s0-kafka-preorder(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: kafka-scaledobject,},MatchExpressions:[]LabelSelectorRequirement{},}) above target Normal SuccessfulRescale 14m horizontal-pod-autoscaler New size: 1; reason: All metrics below target arjunpandey$`

Actual Behavior

In the command- kubectl describe hpa -nkafka

Metrics: ( current / target ) "s0-kafka-preorder" (target average value): 10 / 10

It should be

Metrics: ( current / target ) "s0-kafka-preorder" (target average value): 33 / 10

The Pod count should have scaled up from 1 to 2.

Steps to Reproduce the Problem

Mentioned in the above comment. Any help highly appreciated :-)

It works for the scenarios:

lagThreshold: "10" and actual lag 19
lagThreshold: "30" and actual lag 47.

But not for, lagThreshold: "10" and actual lag > 19 ( In my case I have tested with 23, 33, 47 it didn't work). I have also test the latest version ( 2.8.1) of Keda but the same problem.

Specifications

KEDA Version: 2.7.2
Platform & Version: eks.5
Kubernetes Version: 1.22
Scaler(s): keda.sh/v1alpha1

Big thanks for developing such a wonderful most awaited product :-)

raghvendrak-vn commented 1 year ago

Any update on this. We had seen something similar metricServer was reporting metric as 1 ?

Looked like the sarama client didn't query all partitions under the hood but just the first one ?

tonetechnician commented 11 months ago

Yes, I'm also getting this issue in my production stack. Super strange. It seems that the lag threshold I set is just multiplied by the max number of replicas. Then I have my pods scaling up and down constantly. Very odd behaviour and a bit difficult to debug.

When I query the metric server and the metric I get a value that is equal to the multiplication of the threshold by the amount of replicas?

My trigger config is

What's strange is when testing this in a local k8s cluster with kind, the metric values are reported correctly. So the only different right now is. that my production stack is using AWS MSK, whilst kind is using a local deployment of kafka.

tonetechnician commented 10 months ago

Just mentioning here when I set allowIdleConsumers to true in the kafka trigger the metrics are the same as the lag. Still not quite sure if the other behaviour is expected. Maybe there is a logic error happening somewhere, or I'm misunderstanding something in the implementation

kedacore / charts