kedacore / keda

KEDA is a Kubernetes-based Event Driven Autoscaling component. It provides event driven scale for any container running in Kubernetes
https://keda.sh
Apache License 2.0
8.27k stars 1.05k forks source link

Autoscaling stopped working because of TLS issue #5683

Closed monotek closed 1 week ago

monotek commented 4 months ago

Report

We use Keda 2.13.1 and CertManager to issue our TLS certs.

We recently noticed that autoscaling of our workloads, via prometheus trigger, stopped working.

The keda created hpa had the following events:

the HPA was unable to compute the replica count: unable to get external
metric
postman/s0-prometheus/&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name:
postman,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to
fetch metrics from external metrics API: the server was unable to return
a response in the time allotted, but may still be processing the request
(get s0-prometheus.external.metrics.k8s.io)

The metric itself could be queried without problems.

The logs of all 3 keda-metrics-apiserver pods had errors like this:

W0410 11:34:19.738975 1 logging.go:59] [core] [Channel #1 SubChannel #5] grpc: addrConn.createTransport failed to connect to {Addr: "keda-operator.keda.svc.cluster.local:9666", ServerName: "keda-operator.keda.svc.cluster.local:9666", }. Err: connection error: desc = "transport: authentication handshake failed: tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of \"x509: invalid signature: parent certificate cannot sign this kind of certificate\" while trying to verify candidate authority certificate \"keda-operator\")"

The certifcate created by the cert-manager issuer was valid and was not renewed recently.

After restarting keda-metrics-apiserver deployment it worked again.

Expected Behavior

Autoscaling works

Actual Behavior

Autoscaling stopped working

Steps to Reproduce the Problem

Not sure

Logs from KEDA operator

example

KEDA Version

2.13.1

Kubernetes Version

1.28

Platform

Microsoft Azure

Scaler Details

Prometheus

Anything else?

No response

JorTurFer commented 4 months ago

hum... It seems that the certificate isn't hot reloaded on changes. I think that we have to add a watcher for the cert files and restart the server in case of changes :/

monotek commented 4 months ago

But im pretty sure the cert did not change. Here the status of the cert:

status:
  conditions:
  - lastTransitionTime: "2023-12-04T14:08:02Z"
    message: Certificate is up to date and has not expired
    observedGeneration: 2
    reason: Ready
    status: "True"
    type: Ready
  notAfter: "2025-04-04T06:08:03Z"
  notBefore: "2024-04-04T06:08:03Z"
  renewalTime: "2024-08-03T22:08:03Z"
  revision: 2

So the only thing i can think of are pod restarts because of node updates or maybe short k8s api downtime?

Nevertheless some sort of retry or restart might solve the issue anyway.

JorTurFer commented 4 months ago

The problem is that the log you sent says problems validating the cert, not just the connection but the cert, that's the weird part for me :/

JorTurFer commented 4 months ago

Could you check the secret which contains the cert to verify if it was recreated ? (checking the certificate creation time, not just the secret). I see notBefore: "2024-04-04T06:08:03Z" so it could have been updated. If you want, you can share the tls.crt (it's not a secret, the secret is the key) and we can check the creation date

monotek commented 4 months ago

It seeems the certificate has indeed been updated some days before the issue and the "renewalTime: "2024-08-03T22:08:03Z" in the certificate resource is misleading.

 k -n keda get secrets kedaorg-certs -o yaml | yq e '.data."tls.crt"' | base64 -d | openssl x509  -noout -text| grep Validity -A 2
        Validity
            Not Before: Apr  4 06:08:03 2024 GMT
            Not After : Apr  4 06:08:03 2025 GMT

So i guess the issue than was popping up when only one of the keda-operator or the keda-metrics-apiserver was restartet?

At least we got another issue like that in another cluster today, where the keda-operator was oom killed (and likely used the new cert when restarting) and afterwards the same errors came up again in keda-metrics-apiserver , which were fixed with a keda-metrics-apiserver restart again.

stale[bot] commented 2 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

stale[bot] commented 2 months ago

This issue has been automatically closed due to inactivity.

monotek commented 2 months ago

Not stale

stale[bot] commented 2 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

stale[bot] commented 1 week ago

This issue has been automatically closed due to inactivity.