[kube-prometheus-stack] TLS handshake error: client sent an HTTP request to an HTTPS server

kapishreshth commented 1 month ago

Describe the bug a clear and concise description of what the bug is.

I did a fresh checkout of "kube-prometheus-stack" helm chart and setup on AWS EKS cluster. All pods are running fine. I set agent mode as agentMode: true in values.yaml file.

It can scrape pods metrics to Grafana. Everything works as expected except one error I observed in operator pod logs as following. This tls handshake error keeps coming. Not sure what that ip:port is?

Another tls error was also there before this tls error. So, to fix that one I added below change in values.yaml file under the kubEtcd ServiceMonitor component and worked.

serviceMonitor: tlsConfig: insecureSkipVerify: true

However, this tls error stated in the screenshot above is still clueless. It would be immense help if someone could provide any input. Thank you!

Do let me know if the information is not sufficient. Please excuse me for the format, posting for the first time.

What's your helm version?

version.BuildInfo{Version:"v3.15.2", GitCommit:"1a500d5625419a524fdae4b33de351cc4f58ec35", GitTreeState:"clean", GoVersion:"go1.22.4"}

What's your kubectl version?

Client Version: v1.29.2 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.27.16-eks-a737599

Which chart?

kube-prometheus-stack in agent mode

What's the chart version?

63.1.0

What happened?

I did a fresh checkout of "kube-prometheus-stack" helm chart and setup on AWS EKS cluster. All pods are running fine. I set agent mode as agentMode: true in values.yaml file.

It can scrape pods metrics to Grafana. Everything works as expected except one error I observed in operator pod logs as following. This tls handshake error keeps coming. Not sure what that ip:port is?

Another tls error was also there before this tls error. So, to fix that one I added below change in values.yaml file under the kubEtcd ServiceMonitor component and worked.

serviceMonitor: tldConfig: insecureSkipVerify: true

However, this tls error stated in the screenshot above is still clueless. It would be immense help if someone could provide any input. Thank you!

Do let me know if information is not sufficient. Please excuse me for the format, posting for the first time.

What you expected to happen?

No response

How to reproduce it?

No response

Enter the changed values of values.yaml?

No response

Enter the command that you execute and failing/misfunctioning.

helm install kube-prometheus-stack

Anything else we need to know?

No response

zeritti commented 1 month ago

This tls handshake error keeps coming. Not sure what that ip:port is?

You'd have to determine which pod that IP belongs to assuming it is a client on the pod network.

Prometheus operator gets regularly accessed by two client groups only: Prometheus when scraping its metrics endpoint and kube-api-server when communicating with the webhook.

If you enable TLS in Prometheus operator, its service monitor gets adjusted for TLS so that Prometheus scrapes over TLS with https client. As to the webhook, kube-api-server refuses not to communicate over TLS, so that it always is a https client.

See whether you can find that client's IP address amongst pods' IP addresses, e.g. with a command like this:

kubectl get pod \
  -o custom-columns='NAMESPACE:.metadata.namespace,NAME:.metadata.name,IP:status.podIP'

Depending on your permissions, you can apply it on your monitoring namespace or cluster wide (-A). I reckon that client runs outside of the monitoring stack.

erikschul commented 1 month ago

On my cluster it seems to be cilium's envoy proxy, as far as I can tell:

operator logs:

level=error caller=/opt/hostedtoolcache/go/1.23.1/x64/src/net/http/server.go:3487 msg="http: TLS handshake error from 10.10.1.63:55470: remote error: tls: bad certificate"

kubectl -n kube-system exec ds/cilium -- cilium status --all-controllers --all-health --all-redirects
...
Proxy Status:            OK, ip 10.10.1.63, 0 redirects active on ports 10000-20000, Envoy: external
...

erikschul commented 1 month ago

But there's also:

level=warn caller=/home/runner/work/prometheus-operator/prometheus-operator/pkg/server/server.go:164 msg="server TLS client verification disabled" client_ca_file=/etc/tls/private/tls-ca.crt err="stat /etc/tls/private/tls-ca.crt: no such file or directory"

Is it possible that the helm chart doesn't configure the admissions webhook correctly, to use the cluster ca? The prometheus-operator has a detailed guide: https://prometheus-operator.dev/docs/platform/webhook/ but I don't see any Certificate CRD being created by the chart. Perhaps cilium-envoy tries to contact the admissions webhook and fails?

prometheus-community / helm-charts