kedacore / keda

KEDA is a Kubernetes-based Event Driven Autoscaling component. It provides event driven scale for any container running in Kubernetes
https://keda.sh
Apache License 2.0
8.48k stars 1.07k forks source link

Unable to get external metric on GKE #5527

Closed duydo-ct closed 8 months ago

duydo-ct commented 8 months ago

Report

Setting up KEDA on GKE, scaledObject for VictoriaMetrics was not able to scale up the target deployment.

Expected Behavior

To scale up the target deployment depending upon the unacknowledged messages.

Actual Behavior

HPA can't scale to deployment.

Steps to Reproduce the Problem

  1. Setting up KEDA on GKE
  2. Create a scaledObject

Logs from KEDA operator

KEDA Version

2.12.1

Kubernetes Version

1.26

Platform

Google Cloud

Scaler Details

VictoriaMetrics

Anything else?

No response

JorTurFer commented 8 months ago

It seems that KEDA metrics server can't reach the operator pod. Are you using network policies or something so to manage the networking within the cluster?

duydo-ct commented 8 months ago

Hi @JorTurFer KEDA has deployed under helm and GKE, and I have allowed the firewall.

k get apiservices
NAME                                                  SERVICE                                                                         AVAILABLE         AGE
...
v1beta1.external.metrics.k8s.io        keda-system/keda-operator-metrics-apiserver        True                     40d
v1beta1.metrics.k8s.io                       kube-system/metrics-server                                       True                     2y30d
...

And I describe these apiservices

k describe  apiservices v1beta1.external.metrics.k8s.io
Name:         v1beta1.external.metrics.k8s.io
Namespace:
Labels:       app.kubernetes.io/component=operator
              app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=v1beta1.external.metrics.k8s.io
              app.kubernetes.io/part-of=keda-operator
              app.kubernetes.io/version=2.12.1
              helm.sh/chart=keda-2.12.1
Annotations:  meta.helm.sh/release-namespace: keda-system
API Version:  apiregistration.k8s.io/v1
Kind:         APIService
Metadata:
  Creation Timestamp:  2024-01-16T10:47:44Z
  Resource Version:    819888502
  UID:                 77d55c9b-3a06-430f-972d-3aeacb7b70dc
Spec:
  Ca Bundle:               LS0tLSxxxxxxxxxxxxxxxxxxx
  Group:                   external.metrics.k8s.io
  Group Priority Minimum:  100
  Service:
    Name:            keda-operator-metrics-apiserver
    Namespace:       keda-system
    Port:            443
  Version:           v1beta1
  Version Priority:  100
Status:
  Conditions:
    Last Transition Time:  2024-02-22T03:12:07Z
    Message:               all checks passed
    Reason:                Passed
    Status:                True
    Type:                  Available
Events:                    <none>

Can you give me a direction to troubleshoot it. I looked up the troubleshooting on the homepage and it didn't really apply to my case.

JorTurFer commented 8 months ago

This issue is because KEDA pods can't communicate between them. Do you have any network policy in the cluster blocking internal traffic? KEDA's metrics server pod can reach KEDA's operator.

If you deploy a random pod in keda-system namespace and execute a curl from there to keda-operator.keda-system.svc.cluster.local:9666, does it work?

duydo-ct commented 8 months ago

Hi @JorTurFer So I tested 2 cases in the helm chart:

** server can't find keda-operator.keda-system.svc.cluster.local: NXDOMAIN

/workspace # curl keda-operator.keda-system.svc.cluster.local:9666 curl: (6) Could not resolve host: keda-operator.keda-system.svc.cluster.local

- Case 2: I change `clusterDomain:` to new value

-- Kubernetes cluster domain

clusterDomain: ct.dev

Because. My GKE using CloudDNS of GCP. and `curl` got like

/workspace # curl keda-operator.keda-system.svc.ct.dev:9666 curl: (6) Could not resolve host: keda-operator.keda-system.svc.ct.dev

so I check logs of `keda-operator-metrics-apiserver`

W0226 09:42:36.952229 1 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "keda-operator.keda-system.svc.ct.dev:9666", ServerName: "keda-operator.keda-system.svc.ct.dev:9666", }. Err: connection error: desc = "transport: Error while dialing: dial tcp: lookup keda-operator.keda-system.svc.ct.dev on 169.254.169.254:53: no such host"


Thank you bro
JorTurFer commented 8 months ago

So, is the service not available? what do you see as output from kubectl get svc -o wide -n keda-system?

duydo-ct commented 8 months ago

Hi @JorTurFer J Output

kubectl get svc -o wide -n keda-system
NAME                              TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)            AGE   SELECTOR
keda-admission-webhooks           ClusterIP   10.99.193.232   <none>        443/TCP            40d   app=keda-admission-webhooks
keda-operator                     ClusterIP   10.99.194.245   <none>        9666/TCP           40d   app=keda-operator
keda-operator-metrics-apiserver   ClusterIP   10.99.202.86    <none>        443/TCP,8080/TCP   40d   app=keda-operator-metrics-apiserver
JorTurFer commented 8 months ago

I can see the service there, so IDK why the host can't be resolved 🤔 Maybe it's something related with the DNS resolution in GKE? Could you try this curl curl keda-operator.keda-system:9666?

The self generated certificate has these configurations: https://github.com/kedacore/keda/blob/b3f554899d610cc9d7c5f6a8f94b404ce829876d/pkg/certificates/certificate_manager.go#L100-L108

Why do I say it? Because if curl works using just service.namespace, you can override the value to use it thanks to the arg metrics-service-address, you can just set --metrics-service-address=keda-operator.keda-system:9666 in the metrics server and it will use the new host without the cluster DNS

duydo-ct commented 8 months ago

Hi @JorTurFer So I miss config clusterDomain. I update it again like this:

# -- Kubernetes cluster domain
clusterDomain: gke1.ct.dev

Now. I checked again with curl, telnet, and nslookup

nettools:/workspace#
nettools:/workspace#
nettools:/workspace# curl keda-operator.keda-system.svc.gke1.ct.dev:9666
curl: (52) Empty reply from server
nettools:/workspace# nslookup keda-operator.keda-system.svc.gke1.ct.dev
Server:     169.254.169.254
Address:    169.254.169.254#53

Non-authoritative answer:
Name:   keda-operator.keda-system.svc.gke1.ct.dev
Address: 10.99.194.245

nettools:/workspace# telnet keda-operator.keda-system.svc.gke1.ct.dev 9666
Connected to keda-operator.keda-system.svc.gke1.ct.dev

I think DNS works. But I don't know when to curl it became empty Any idea bro?

JorTurFer commented 8 months ago

The gke1 is missing in your previous message https://github.com/kedacore/keda/issues/5527#issuecomment-1963699347 and I bet that it's the root cause xD

Could you try updating KEDA to set the cluster domain as gke1.ct.dev? You could have to delete the secret kedaorg-certs within keda's namespace (and restart KEDA components)

duydo-ct commented 8 months ago

Hi @JorTurFer Nice bro. I deleted the secret kedaorg-certs and restarted deploy to check logs it okay bro

kubectl delete secret kedaorg-certs -n keda-system
secret "kedaorg-certs" deleted

I'm not sure if there is any missing config. ?

JorTurFer commented 8 months ago

If we don't have the communication issues we are moving forward indeed! 😄 Could you try copycatting the exact query into your prometheus? sum(ad_listing_system_logic_priority_queue_tasks_counter{deployment="ct-logic-uni-ad-listing-metrics"}[2m])

The picture doesn't show the same query as it has no filters. The problem because I ask this is because I ignoreNullValues: true can hide querying errors converting null values into 0, which can fit in your case (you can just try removing the property temporally and if I'm right, you will see errors in KEDA opertor)

duydo-ct commented 8 months ago

Hi @JorTurFer I checked again. so the query is incorrect bro

sum(ad_listing_system_logic_priority_queue_tasks_counter{deployment="ct-logic-uni-ad-listing-metrics"}[2m])

and I update new the query like

sum(ad_listing_system_logic_priority_queue_tasks_counter{app="ct-logic-uni-ad-listing-metrics"}[1m])

and I checked again and it worked as expected. Events:

Conditions:
  Type            Status  Reason               Message
  ----            ------  ------               -------
  AbleToScale     True    ScaleDownStabilized  recent recommendations were higher than current one, applying the highest recent recommendation
  ScalingActive   True    ValidMetricFound     the HPA was able to successfully calculate a replica count from external metric s0-prometheus(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: ct-logic-uni-ad-listing-consumer,},MatchExpressions:[]LabelSelectorRequirement{},})
  ScalingLimited  True    TooManyReplicas      the desired replica count is more than the maximum replica count
Events:
  Type    Reason             Age                  From                       Message
  ----    ------             ----                 ----                       -------
  Normal  SuccessfulRescale  16m                  horizontal-pod-autoscaler  New size: 2; reason: external metric s0-prometheus(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: ct-logic-uni-ad-listing-consumer,},MatchExpressions:[]LabelSelectorRequirement{},}) above target
  Normal  SuccessfulRescale  7m53s                horizontal-pod-autoscaler  New size: 12; reason: external metric s0-prometheus(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: ct-logic-uni-ad-listing-consumer,},MatchExpressions:[]LabelSelectorRequirement{},}) below target
  Normal  SuccessfulRescale  6m20s                horizontal-pod-autoscaler  New size: 8; reason: external metric s0-prometheus(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: ct-logic-uni-ad-listing-consumer,},MatchExpressions:[]LabelSelectorRequirement{},}) below target
  Normal  SuccessfulRescale  6m6s                 horizontal-pod-autoscaler  New size: 2; reason: external metric s0-prometheus(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: ct-logic-uni-ad-listing-consumer,},MatchExpressions:[]LabelSelectorRequirement{},}) below target
  Normal  SuccessfulRescale  3m41s (x2 over 15m)  horizontal-pod-autoscaler  New size: 4; reason: external metric s0-prometheus(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: ct-logic-uni-ad-listing-consumer,},MatchExpressions:[]LabelSelectorRequirement{},}) above target
  Normal  SuccessfulRescale  3m25s (x2 over 15m)  horizontal-pod-autoscaler  New size: 8; reason: external metric s0-prometheus(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: ct-logic-uni-ad-listing-consumer,},MatchExpressions:[]LabelSelectorRequirement{},}) above target
  Normal  SuccessfulRescale  3m9s (x2 over 15m)   horizontal-pod-autoscaler  New size: 16; reason: external metric s0-prometheus(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: ct-logic-uni-ad-listing-consumer,},MatchExpressions:[]LabelSelectorRequirement{},}) above target
  Normal  SuccessfulRescale  2m53s (x2 over 15m)  horizontal-pod-autoscaler  New size: 20; reason: external metric s0-prometheus(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: ct-logic-uni-ad-listing-consumer,},MatchExpressions:[]LabelSelectorRequirement{},}) above target
JorTurFer commented 8 months ago

nice! I close the issue as it looks solved, let me know if there is any other issue and I'll open it again