Closed typokign closed 5 months ago
Yes, you're right, but the recommended upgrade sequence should always start with controlplane nodes first, so upgrading workers first might not work (like in the case you described above).
Please always upgrade controlplanes first!
Bug Report
Description
After upgrading a worker node from Talos 1.6.1 to 1.6.3, the node refused to become healthy and
talosctl logs kubelet
reported TLS errors trying to speak to the local Kubeprism endpoint.It appears the Kubeprism certificate is only valid for the control plane node IP(s) and
kubernetes.default.svc.cluster.local
IP, and does not include 127.0.0.1.Downgrading the node back to 1.6.1 resolved the error and the node became healthy again.
Logs
(10.0.10.10 is my single control plane node IP, 10.96.0.1 is the IP of
kubectl -n default get svc kubernetes
)Environment
My cluster is running Cilium CNI, with kubeproxy disabled and kubeprism enabled (by default) to listen on localhost:7445. Following the instructions in https://www.talos.dev/v1.6/kubernetes-guides/network/deploying-cilium/, I have this Talos config patch applied to the node:
And these relevant values in my Cilium 1.15.0 helm chart:
Hope this helps, happy to share any more details if needed :)