Open converge opened 4 months ago
Could you change the log-level to debug in operator and send the operator logs?
In addition to the log, could you provide the ScaledObject
yaml config too ?
sorry for the delay, I still have this issue and created a new GCP/GKE cluster specially to debug it. I was able to reproduce the issue and got the logs from the exact moment where the current metric value switch in the HPA from <unknown>
to -9223372036854775808m
.
the logs from the controller when this switch happened:
2024-06-05T19:43:42Z DEBUG kafka_scaler Kafka scaler: Providing metrics based on totalLag 1, topicPartitions 1, threshold 60 {"type": "ScaledObject", "namespace": "app", "name": "kafka-scaledobject"}
2024-06-05T19:43:42Z DEBUG scale_handler Getting metrics and activity from scaler {"scaledObject.Namespace": "app", "scaledObject.Name": "kafka-scaledobject", "scaler": "kafkaScaler", "metricName": "s0-kafka-knative-group-topics", "metrics": [{"metricName":"s0-kafka-knative-group-topics","metricLabels":null,"timestamp":"2024-06-05T19:43:42Z","value":"1"}], "activity": true, "scalerError": null}
2024-06-05T19:43:42Z DEBUG scale_handler Scaler for scaledObject is active {"scaledObject.Namespace": "app", "scaledObject.Name": "kafka-scaledobject", "scaler": "kafkaScaler", "metricName": "s0-kafka-knative-group-topics"}
2024-06-05T19:43:42Z INFO scaleexecutor Successfully updated ScaleTarget {"scaledobject.Name": "kafka-scaledobject", "scaledObject.Namespace": "app", "scaleTarget.Name": "kafka-source", "Original Replicas Count": 0, "New Replicas Count": 1}
2024-06-05T19:43:42Z DEBUG events Scaled sources.knative.dev/v1beta1.kafkasource app/kafka-source from 0 to 1 {"type": "Normal", "object": {"kind":"ScaledObject","namespace":"app","name":"kafka-scaledobject","uid":"01537a1d-1543-48ea-adce-e7c3069fff86","apiVersion":"keda.sh/v1alpha1","resourceVersion":"314645"}, "reason": "KEDAScaleTargetActivated"}
2024-06-05T19:44:12Z DEBUG kafka_scaler with topic name [knative-demo-topic] the list of topic metadata is [0xc000a1b400] {"type": "ScaledObject", "namespace": "app", "name": "kafka-scaledobject"}
2024-06-05T19:44:12Z DEBUG kafka_scaler Kafka scaler: Providing metrics based on totalLag 1, topicPartitions 1, threshold 60 {"type": "ScaledObject", "namespace": "app", "name": "kafka-scaledobject"}
2024-06-05T19:44:12Z DEBUG scale_handler Getting metrics and activity from scaler {"scaledObject.Namespace": "app", "scaledObject.Name": "kafka-scaledobject", "scaler": "kafkaScaler", "metricName": "s0-kafka-knative-group-topics", "metrics": [{"metricName":"s0-kafka-knative-group-topics","metricLabels":null,"timestamp":"2024-06-05T19:44:12Z","value":"1"}], "activity": true, "scalerError": null}
2024-06-05T19:44:12Z DEBUG scale_handler Scaler for scaledObject is active {"scaledObject.Namespace": "app", "scaledObject.Name": "kafka-scaledobject", "scaler": "kafkaScaler", "metricName": "s0-kafka-knative-group-topics"}
2024-06-05T19:44:42Z DEBUG kafka_scaler with topic name [knative-demo-topic] the list of topic metadata is [0xc000ae9400] {"type": "ScaledObject", "namespace": "app", "name": "kafka-scaledobject"}
2024-06-05T19:44:42Z DEBUG kafka_scaler Kafka scaler: Providing metrics based on totalLag 1, topicPartitions 1, threshold 60 {"type": "ScaledObject", "namespace": "app", "name": "kafka-scaledobject"}
2024-06-05T19:44:42Z DEBUG scale_handler Getting metrics and activity from scaler {"scaledObject.Namespace": "app", "scaledObject.Name": "kafka-scaledobject", "scaler": "kafkaScaler", "metricName": "s0-kafka-knative-group-topics", "metrics": [{"metricName":"s0-kafka-knative-group-topics","metricLabels":null,"timestamp":"2024-06-05T19:44:42Z","value":"1"}], "activity": true, "scalerError": null}
2024-06-05T19:44:42Z DEBUG scale_handler Scaler for scaledObject is active {"scaledObject.Namespace": "app", "scaledObject.Name": "kafka-scaledobject", "scaler": "kafkaScaler", "metricName": "s0-kafka-knative-group-topics"}
2024-06-05T19:44:52Z DEBUG kafka_scaler with topic name [knative-demo-topic] the list of topic metadata is [0xc000460050] {"type": "ScaledObject", "namespace": "app", "name": "kafka-scaledobject"}
2024-06-05T19:44:52Z DEBUG kafka_scaler Kafka scaler: Providing metrics based on totalLag 1, topicPartitions 1, threshold 60 {"type": "ScaledObject", "namespace": "app", "name": "kafka-scaledobject"}
2024-06-05T19:44:52Z DEBUG scale_handler Getting metrics from trigger {"scaledObject.Namespace": "app", "scaledObject.Name": "kafka-scaledobject", "trigger": "kafkaScaler", "metricName": "s0-kafka-knative-group-topics", "metrics": [{"metricName":"s0-kafka-knative-group-topics","metricLabels":null,"timestamp":"2024-06-05T19:44:52Z","value":"1"}], "scalerError": null}
2024-06-05T19:44:52Z DEBUG grpc_server Providing metrics {"scaledObjectName": "kafka-scaledobject", "scaledObjectNamespace": "app", "metrics": "&ExternalMetricValueList{ListMeta:{ <nil>},Items:[]ExternalMetricValue{ExternalMetricValue{MetricName:s0-kafka-knative-group-topics,MetricLabels:map[string]string{},Timestamp:2024-06-05 19:44:52.805581005 +0000 UTC m=+3545.120570672,WindowSeconds:nil,Value:{{1000 -3} {<nil>} DecimalSI},},},}"}
2024-06-05T19:45:08Z DEBUG kafka_scaler with topic name [knative-demo-topic] the list of topic metadata is [0xc000550c30] {"type": "ScaledObject", "namespace": "app", "name": "kafka-scaledobject"}
2024-06-05T19:45:08Z DEBUG kafka_scaler Kafka scaler: Providing metrics based on totalLag 1, topicPartitions 1, threshold 60 {"type": "ScaledObject", "namespace": "app", "name": "kafka-scaledobject"}
2024-06-05T19:45:08Z DEBUG scale_handler Getting metrics from trigger {"scaledObject.Namespace": "app", "scaledObject.Name": "kafka-scaledobject", "trigger": "kafkaScaler", "metricName": "s0-kafka-knative-group-topics", "metrics": [{"metricName":"s0-kafka-knative-group-topics","metricLabels":null,"timestamp":"2024-06-05T19:45:08Z","value":"1"}], "scalerError": null}
2024-06-05T19:45:08Z DEBUG grpc_server Providing metrics {"scaledObjectName": "kafka-scaledobject", "scaledObjectNamespace": "app", "metrics": "&ExternalMetricValueList{ListMeta:{ <nil>},Items:[]ExternalMetricValue{ExternalMetricValue{MetricName:s0-kafka-knative-group-topics,MetricLabels:map[string]string{},Timestamp:2024-06-05 19:45:08.16711765 +0000 UTC m=+3560.482107310,WindowSeconds:nil,Value:{{1000 -3} {<nil>} DecimalSI},},},}"}
the scaled object:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: kafka-scaledobject
namespace: app
annotations:
scaledobject.keda.sh/transfer-hpa-ownership: "true"
spec:
advanced:
horizontalPodAutoscalerConfig:
name: notify-friends
scaleTargetRef:
apiVersion: sources.knative.dev/v1beta1
name: kafka-source
kind: kafkasource
pollingInterval: 30
minReplicaCount: 1
maxReplicaCount: 11
idleReplicaCount: 0
triggers:
- type: kafka
metadata:
bootstrapServers: pkc-<...>.gcp.confluent.cloud:9092
consumerGroup: knative-group
topic:
lagThreshold: '60'
offsetResetPolicy: earliest
scaleToZeroOnInvalidOffset: "false"
tls: enable
sasl: plaintext
authenticationRef:
name: keda-kafka-credentials
found people with similar issue https://kubernetes.slack.com/archives/CKZJ36A5D/p1709761505122509
@SpiritZhou @dttung2905 yesterday I created an EKS cluster with the same GCP setup/versions, and it works perfectly. Can you think of anything that could be different for GCP and AWS? any kind of blocker or anything that could be causing the issue?
I have also followed the k8s events and couldn't find anything bad in GCP. Wanted to look for horizontal pod autoscaler logs, but couldn't find them.
I did some more tests, I believe kafka connection is ok, I was able to produce and consume messages inside the pod using Go Sarama library (the same knative kafka extension library).
I created a debian pod in Kubernetes/GCP/GKE, attached to it and:
./kafka-consumer-groups.sh --bootstrap-server pkc-<...>.<...>.gcp.confluent.cloud:9092 --command-config config.properties --describe --group testing-group
Consumer group 'testing-group' has no active members.
# no messages produced
GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID
testing-group knative-demo-topic 1 1 1 0 - - -
testing-group knative-demo-topic 8 1 1 0 - - -
testing-group knative-demo-topic 22 15 15 0 - - -
testing-group knative-demo-topic 29 1 1 0 - - -
# no messages consumed
GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID
testing-group knative-demo-topic 1 1 30 29 - - -
testing-group knative-demo-topic 8 1 36 35 - - -
testing-group knative-demo-topic 22 15 54 39 - - -
testing-group knative-demo-topic 29 1 38 37 - - -
# after messages being consumed
GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID
testing-group knative-demo-topic 1 30 30 0 - - -
testing-group knative-demo-topic 8 36 36 0 - - -
testing-group knative-demo-topic 22 54 54 0 - - -
testing-group knative-demo-topic 29 38 38 0 - - -
kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/app/s0-kafka-knative-group-topics?labelSelector=scaledobject.keda.sh%2Fname%3Dkafka-scaledobject"
{"kind":"ExternalMetricValueList","apiVersion":"external.metrics.k8s.io/v1beta1","metadata":{},"items":[{"metricName":"s0-kafka-knative-group-topics","metricLabels":null,"timestamp":"2024-06-13T12:26:02Z",
---> "value":"1009"}]} <---
~/tmp/sandbox 🏖️ sandbox k get hpa -n app
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
notify-friends kafkasource/kafka-source -9223372036854775808m/60 (avg) 1 11 11 3m1s
another thing I noticed/not sure if relevant is that the metric is not listed here:
kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1" | jq
{
"kind": "APIResourceList",
"apiVersion": "v1",
"groupVersion": "external.metrics.k8s.io/v1beta1",
"resources": [
{
"name": "externalmetrics",
"singularName": "",
"namespaced": true,
"kind": "ExternalMetricValueList",
"verbs": [
"get"
]
}
]
}
but I can query it, as it shows above.
You should try to query the specific metric for the ScaledObject, see the examples down below: https://keda.sh/docs/2.14/operate/metrics-server/#querying-metrics-exposed-by-keda-metrics-server
thanks @zroubalik , Im able to query the metric, and I can also see the metric value via open telemetry/datadog, the metric value is correct/the expected value, and not the unexpected -9223372036854775808m
value.
-9223372036854775808m
value is a result of converting +Inf
to int64
.
This code:
fmt.Println("+Inf", int64(math.Inf(1)))
fmt.Println("-Inf", int64(math.Inf(-1)))
fmt.Println("NaN", int64(math.NaN()))
prints
+Inf 9223372036854775807
-Inf -9223372036854775808
NaN 0
on Apple M1. On same Apple M1, but when compiling with GOARCH=amd64
, it prints:
+Inf 9223372036854775807
-Inf -9223372036854775808
NaN -9223372036854775808
There are places in HPA controller where such conversion is happening. One case that I was investigating is when using KEDA to target custom resource, with TargetType AverageValue
. In such case, the following method is responsible for computing usage:
If statusReplicas
(read from /status
subresource of target resource) is 0, then computed usage
will be wrong (-9223372036854775808m): https://github.com/kubernetes/kubernetes/blob/a145f1508d2180c93ba292849f0bf4bf6120fae8/pkg/controller/podautoscaler/replica_calculator.go#L374 (notice conversion from float64 to int64 here)
This usage
value is then stored into HPA's metric status: https://github.com/kubernetes/kubernetes/blob/985ef265511370655452157fb5c583d5bf7a45e4/pkg/controller/podautoscaler/horizontal.go#L687-L703
On EKS, when we fixed status.replicas on the target resource, things started to work correctly for us. On GKE, we still see -9223372036854775808m
value, and don't understand why.
I've followed up on this issue.
We've seen the actual calculation of the scaler to be correct. However HPA is displaying the values wrong when queried.
@pstibrany As you pointed out the issue seems to be how the usage is calculated. I still have pending to try to use Value
instead of AverageValue
as the former won't do any calculation with the metric gotten from the Kafka Cluster. If then the value is displayed correctly we would have found the reason.
Looking at the source code in the method it turns out that even if number of replicas is 0, the replicas calculation is correct because in the end the statusReplicas
is not use to calculate the replicas only to prevent increasing the replica count if the new value grew below the toleration.
And as toleration is calculated by 1-(-9223372036854775808)
the result is a positive number that is always over the toleration threshold 🤣.
Knowing this is the root cause granted me peace of mind, thank you!
Now the remaining question is why scale.Status.Replicas
is 0 in GCP, but that's another path to investigate.
That's assuming that GCP did not deploy their own custom version of the Autoscaler and that introduced something else that could be messing around.
Report
KEDA is unable to retrieve metrics correctly from a ScaledObject/ScaleTarget using a Kafka trigger when deployed to a GKE cluster (It works locally)
Expected Behavior
When HPA calculates the current metric value, it should not return
-9223372036854775808m
, but a valid Kafka lag.Actual Behavior
When the Kafka ScaledObject is deployed to GKE:
Steps to Reproduce the Problem
Logs from KEDA operator
There is no error or warning in the Keda operator.
KEDA Version
2.13.1
Kubernetes Version
1.27
Platform
Google Cloud
Scaler Details
Kafka
Anything else?
No response