siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.
https://www.talos.dev
Mozilla Public License 2.0
5.75k stars 466 forks source link

Regression: Kubeprism TLS validation error in Talos 1.6.3 #8254

Closed typokign closed 5 months ago

typokign commented 5 months ago

Bug Report

Description

After upgrading a worker node from Talos 1.6.1 to 1.6.3, the node refused to become healthy and talosctl logs kubelet reported TLS errors trying to speak to the local Kubeprism endpoint.

It appears the Kubeprism certificate is only valid for the control plane node IP(s) and kubernetes.default.svc.cluster.local IP, and does not include 127.0.0.1.

Downgrading the node back to 1.6.1 resolved the error and the node became healthy again.

Logs

10.0.10.11: {"ts":1707002771809.609,"caller":"kubelet/kubelet_node_status.go:73","msg":"Attempting to register node","v":0,"node":{"name":"k8s-worker-1"}}
10.0.10.11: {"ts":1707002771811.513,"caller":"kubelet/kubelet_node_status.go:96","msg":"Unable to register node with API server","node":{"name":"k8s-worker-1"},"err":"Post \"https://127.0.0.1:7445/api/v1/nodes\": tls: failed to verify certificate: x509: certificate is valid for 10.0.10.10, 10.96.0.1, not 127.0.0.1"}

(10.0.10.10 is my single control plane node IP, 10.96.0.1 is the IP of kubectl -n default get svc kubernetes)

Environment

My cluster is running Cilium CNI, with kubeproxy disabled and kubeprism enabled (by default) to listen on localhost:7445. Following the instructions in https://www.talos.dev/v1.6/kubernetes-guides/network/deploying-cilium/, I have this Talos config patch applied to the node:

### Cilium
- op: add
  path: /cluster/network/cni
  value:
    name: none
- op: add
  path: /cluster/proxy
  value:
    disabled: true

And these relevant values in my Cilium 1.15.0 helm chart:

cilium:
  kubeProxyReplacement: true
  k8sServiceHost: localhost
  k8sServicePort: 7445

Hope this helps, happy to share any more details if needed :)

smira commented 5 months ago

Yes, you're right, but the recommended upgrade sequence should always start with controlplane nodes first, so upgrading workers first might not work (like in the case you described above).

Please always upgrade controlplanes first!