Closed duydo-ct closed 8 months ago
It seems that KEDA metrics server can't reach the operator pod. Are you using network policies or something so to manage the networking within the cluster?
Hi @JorTurFer KEDA has deployed under helm and GKE, and I have allowed the firewall.
k get apiservices
NAME SERVICE AVAILABLE AGE
...
v1beta1.external.metrics.k8s.io keda-system/keda-operator-metrics-apiserver True 40d
v1beta1.metrics.k8s.io kube-system/metrics-server True 2y30d
...
And I describe these apiservices
k describe apiservices v1beta1.external.metrics.k8s.io
Name: v1beta1.external.metrics.k8s.io
Namespace:
Labels: app.kubernetes.io/component=operator
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=v1beta1.external.metrics.k8s.io
app.kubernetes.io/part-of=keda-operator
app.kubernetes.io/version=2.12.1
helm.sh/chart=keda-2.12.1
Annotations: meta.helm.sh/release-namespace: keda-system
API Version: apiregistration.k8s.io/v1
Kind: APIService
Metadata:
Creation Timestamp: 2024-01-16T10:47:44Z
Resource Version: 819888502
UID: 77d55c9b-3a06-430f-972d-3aeacb7b70dc
Spec:
Ca Bundle: LS0tLSxxxxxxxxxxxxxxxxxxx
Group: external.metrics.k8s.io
Group Priority Minimum: 100
Service:
Name: keda-operator-metrics-apiserver
Namespace: keda-system
Port: 443
Version: v1beta1
Version Priority: 100
Status:
Conditions:
Last Transition Time: 2024-02-22T03:12:07Z
Message: all checks passed
Reason: Passed
Status: True
Type: Available
Events: <none>
Can you give me a direction to troubleshoot it. I looked up the troubleshooting on the homepage and it didn't really apply to my case.
This issue is because KEDA pods can't communicate between them. Do you have any network policy in the cluster blocking internal traffic? KEDA's metrics server pod can reach KEDA's operator.
If you deploy a random pod in keda-system namespace and execute a curl from there to keda-operator.keda-system.svc.cluster.local:9666
, does it work?
Hi @JorTurFer So I tested 2 cases in the helm chart:
# -- Kubernetes cluster domain
clusterDomain: cluster.local
and execute to pods in the same namespace keda-system
to curl
got like
/workspace # nslookup keda-operator.keda-system.svc.cluster.local
Server: 169.254.169.254
Address: 169.254.169.254#53
** server can't find keda-operator.keda-system.svc.cluster.local: NXDOMAIN
/workspace # curl keda-operator.keda-system.svc.cluster.local:9666 curl: (6) Could not resolve host: keda-operator.keda-system.svc.cluster.local
- Case 2: I change `clusterDomain:` to new value
clusterDomain: ct.dev
Because. My GKE using CloudDNS of GCP. and `curl` got like
/workspace # curl keda-operator.keda-system.svc.ct.dev:9666 curl: (6) Could not resolve host: keda-operator.keda-system.svc.ct.dev
so I check logs of `keda-operator-metrics-apiserver`
W0226 09:42:36.952229 1 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "keda-operator.keda-system.svc.ct.dev:9666", ServerName: "keda-operator.keda-system.svc.ct.dev:9666", }. Err: connection error: desc = "transport: Error while dialing: dial tcp: lookup keda-operator.keda-system.svc.ct.dev on 169.254.169.254:53: no such host"
Thank you bro
So, is the service not available? what do you see as output from kubectl get svc -o wide -n keda-system
?
Hi @JorTurFer J Output
kubectl get svc -o wide -n keda-system
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
keda-admission-webhooks ClusterIP 10.99.193.232 <none> 443/TCP 40d app=keda-admission-webhooks
keda-operator ClusterIP 10.99.194.245 <none> 9666/TCP 40d app=keda-operator
keda-operator-metrics-apiserver ClusterIP 10.99.202.86 <none> 443/TCP,8080/TCP 40d app=keda-operator-metrics-apiserver
I can see the service there, so IDK why the host can't be resolved 🤔
Maybe it's something related with the DNS resolution in GKE? Could you try this curl curl keda-operator.keda-system:9666
?
The self generated certificate has these configurations: https://github.com/kedacore/keda/blob/b3f554899d610cc9d7c5f6a8f94b404ce829876d/pkg/certificates/certificate_manager.go#L100-L108
Why do I say it? Because if curl works using just service.namespace, you can override the value to use it thanks to the arg metrics-service-address
, you can just set --metrics-service-address=keda-operator.keda-system:9666
in the metrics server and it will use the new host without the cluster DNS
Hi @JorTurFer So I miss config clusterDomain
.
I update it again like this:
# -- Kubernetes cluster domain
clusterDomain: gke1.ct.dev
Now. I checked again with curl
, telnet
, and nslookup
nettools:/workspace#
nettools:/workspace#
nettools:/workspace# curl keda-operator.keda-system.svc.gke1.ct.dev:9666
curl: (52) Empty reply from server
nettools:/workspace# nslookup keda-operator.keda-system.svc.gke1.ct.dev
Server: 169.254.169.254
Address: 169.254.169.254#53
Non-authoritative answer:
Name: keda-operator.keda-system.svc.gke1.ct.dev
Address: 10.99.194.245
nettools:/workspace# telnet keda-operator.keda-system.svc.gke1.ct.dev 9666
Connected to keda-operator.keda-system.svc.gke1.ct.dev
I think DNS works. But I don't know when to curl
it became empty
Any idea bro?
The gke1
is missing in your previous message https://github.com/kedacore/keda/issues/5527#issuecomment-1963699347 and I bet that it's the root cause xD
Could you try updating KEDA to set the cluster domain as gke1.ct.dev
? You could have to delete the secret kedaorg-certs
within keda's namespace (and restart KEDA components)
Hi @JorTurFer
Nice bro. I deleted the secret kedaorg-certs
and restarted deploy to check logs it okay bro
kubectl delete secret kedaorg-certs -n keda-system
secret "kedaorg-certs" deleted
keda-operator-metrics-apiserver
. It's Okay
I0226 15:26:59.757886 1 provider.go:81] keda_metrics_adapter/provider "msg"="KEDA Metrics Server received request for external metrics" "metric name"="s0-prometheus" "metricSelector"="scaledobject.keda.sh /name=ct-logic-uni-ad-listing-consumer" "namespace"="default"
ScaledObject
again and I don't see it scale. I checked the HPA
again and there were no error logs. But I'm still not sure why it won't scale my deployment.
I described ScaledObject
:
π ~> kubectl describe ScaledObject ct-logic-uni-ad-listing-consumer
Name: ct-logic-uni-ad-listing-consumer
Namespace: default
Labels: app=ct-logic-uni-ad-listing-consumer
app.kubernetes.io/managed-by=Helm
Annotations: meta.helm.sh/release-name: ct-logic-uni-ad-listing-consumer
meta.helm.sh/release-namespace: default
API Version: keda.sh/v1alpha1
Kind: ScaledObject
Metadata:
Creation Timestamp: 2024-02-26T15:29:23Z
Finalizers:
finalizer.keda.sh
Generation: 1
Resource Version: 827148808
UID: f80f6f14-3015-4e74-b0e3-69a83db30c61
Spec:
Cooldown Period: 100
Max Replica Count: 9
Min Replica Count: 1
Polling Interval: 100
Scale Target Ref:
API Version: apps/v1
Kind: Deployment
Name: ct-logic-uni-ad-listing-consumer
Triggers:
Metadata:
Ignore Null Values: true
Query: sum(ad_listing_system_logic_priority_queue_tasks_counter{deployment="ct-logic-uni-ad-listing-metrics"}[2m])
Server Address: https://vmselect.domain/select/0/prometheus
Threshold: 2
Type: prometheus
Status:
Conditions:
Message: ScaledObject is defined correctly and is ready for scaling
Reason: ScaledObjectReady
Status: True
Type: Ready
Message: Scaling is not performed because triggers are not active
Reason: ScalerNotActive
Status: False
Type: Active
Message: No fallbacks are active on this scaled object
Reason: NoFallbackFound
Status: False
Type: Fallback
Status: Unknown
Type: Paused
External Metric Names:
s0-prometheus
Health:
s0-prometheus:
Number Of Failures: 0
Status: Happy
Hpa Name: keda-hpa-ct-logic-uni-ad-listing-consumer
Original Replica Count: 1
Scale Target GVKR:
Group: apps
Kind: Deployment
Resource: deployments
Version: v1
Scale Target Kind: apps/v1.Deployment
Events: <none>
I described HPA
:
π ~> kubectl describe hpa keda-hpa-ct-logic-uni-ad-listing-consumer
Name: keda-hpa-ct-logic-uni-ad-listing-consumer
Namespace: default
Labels: app=ct-logic-uni-ad-listing-consumer
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=keda-hpa-ct-logic-uni-ad-listing-consumer
Annotations: meta.helm.sh/release-name: ct-logic-uni-ad-listing-consumer
meta.helm.sh/release-namespace: default
CreationTimestamp: Mon, 26 Feb 2024 22:29:53 +0700
Reference: Deployment/ct-logic-uni-ad-listing-consumer
Metrics: ( current / target )
"s0-prometheus" (target average value): 0 / 2
Min replicas: 1
Max replicas: 9
Deployment pods: 1 current / 1 desired
Conditions:
Type Status Reason Message
---- ------ ------ -------
AbleToScale True ReadyForNewScale recommended size matches current size
ScalingActive True ValidMetricFound the HPA was able to successfully calculate a replica count from external metric s0-prometheus(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: ct-logic-uni-ad-listing-consumer,},MatchExpressions:[]LabelSelectorRequirement{},})
ScalingLimited True TooFewReplicas the desired replica count is less than the minimum replica count
Events: <none>
I'm not sure if there is any missing config. ?
If we don't have the communication issues we are moving forward indeed! 😄
Could you try copycatting the exact query into your prometheus? sum(ad_listing_system_logic_priority_queue_tasks_counter{deployment="ct-logic-uni-ad-listing-metrics"}[2m])
The picture doesn't show the same query as it has no filters. The problem because I ask this is because I ignoreNullValues: true
can hide querying errors converting null
values into 0, which can fit in your case (you can just try removing the property temporally and if I'm right, you will see errors in KEDA opertor)
Hi @JorTurFer I checked again. so the query is incorrect bro
sum(ad_listing_system_logic_priority_queue_tasks_counter{deployment="ct-logic-uni-ad-listing-metrics"}[2m])
and I update new the query like
sum(ad_listing_system_logic_priority_queue_tasks_counter{app="ct-logic-uni-ad-listing-metrics"}[1m])
and I checked again and it worked as expected. Events:
Conditions:
Type Status Reason Message
---- ------ ------ -------
AbleToScale True ScaleDownStabilized recent recommendations were higher than current one, applying the highest recent recommendation
ScalingActive True ValidMetricFound the HPA was able to successfully calculate a replica count from external metric s0-prometheus(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: ct-logic-uni-ad-listing-consumer,},MatchExpressions:[]LabelSelectorRequirement{},})
ScalingLimited True TooManyReplicas the desired replica count is more than the maximum replica count
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulRescale 16m horizontal-pod-autoscaler New size: 2; reason: external metric s0-prometheus(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: ct-logic-uni-ad-listing-consumer,},MatchExpressions:[]LabelSelectorRequirement{},}) above target
Normal SuccessfulRescale 7m53s horizontal-pod-autoscaler New size: 12; reason: external metric s0-prometheus(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: ct-logic-uni-ad-listing-consumer,},MatchExpressions:[]LabelSelectorRequirement{},}) below target
Normal SuccessfulRescale 6m20s horizontal-pod-autoscaler New size: 8; reason: external metric s0-prometheus(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: ct-logic-uni-ad-listing-consumer,},MatchExpressions:[]LabelSelectorRequirement{},}) below target
Normal SuccessfulRescale 6m6s horizontal-pod-autoscaler New size: 2; reason: external metric s0-prometheus(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: ct-logic-uni-ad-listing-consumer,},MatchExpressions:[]LabelSelectorRequirement{},}) below target
Normal SuccessfulRescale 3m41s (x2 over 15m) horizontal-pod-autoscaler New size: 4; reason: external metric s0-prometheus(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: ct-logic-uni-ad-listing-consumer,},MatchExpressions:[]LabelSelectorRequirement{},}) above target
Normal SuccessfulRescale 3m25s (x2 over 15m) horizontal-pod-autoscaler New size: 8; reason: external metric s0-prometheus(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: ct-logic-uni-ad-listing-consumer,},MatchExpressions:[]LabelSelectorRequirement{},}) above target
Normal SuccessfulRescale 3m9s (x2 over 15m) horizontal-pod-autoscaler New size: 16; reason: external metric s0-prometheus(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: ct-logic-uni-ad-listing-consumer,},MatchExpressions:[]LabelSelectorRequirement{},}) above target
Normal SuccessfulRescale 2m53s (x2 over 15m) horizontal-pod-autoscaler New size: 20; reason: external metric s0-prometheus(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: ct-logic-uni-ad-listing-consumer,},MatchExpressions:[]LabelSelectorRequirement{},}) above target
nice! I close the issue as it looks solved, let me know if there is any other issue and I'll open it again
Report
Setting up KEDA on GKE, scaledObject for VictoriaMetrics was not able to scale up the target deployment.
Expected Behavior
To scale up the target deployment depending upon the unacknowledged messages.
Actual Behavior
HPA can't scale to deployment.
Steps to Reproduce the Problem
Logs from KEDA operator
keda-operator
:keda-operator-metrics-apiserver
:HPA
:KEDA Version
2.12.1
Kubernetes Version
1.26
Platform
Google Cloud
Scaler Details
VictoriaMetrics
Anything else?
No response