siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.
https://www.talos.dev
Mozilla Public License 2.0
6.9k stars 555 forks source link

Unable to change cluster.endpoint without downtime #9609

Open NikolaiBessonov opened 3 weeks ago

NikolaiBessonov commented 3 weeks ago

Bug Report

Unable to authenticate after changing cluster.controlPlane.endpoint in machineconfig

Description

After updating the cluster.controlPlane.endpoint to point to controlplane-3, authentication from kube-apiserver fails. It seems that the issue is related to the absence of the --service-account-issuer argument, which should contain the new VIP address. This issue is likely occurring because the nodes are still referencing the old load balancer endpoint, resulting in a mismatch. But there are no possibility to set up additional param.

Steps to Reproduce

1.  Set up an external load balancer balancing traffic between three control plane nodes on port 6443.
2.  Point cluster.controlPlane.endpoint to this load balancer.
3.  Add a VIP and assign it to interface **network.interfaces.interface[0].vip.ip.**
4.  Change the **cluster.controlPlane.endpoint** on one of the nodes
5.  Check kube-apiserver logs on the node, where you changed that

Logs

kube-apiserver on the node, where you changed cluster.endpoint: authentication.go:73] "Unable to authenticate the request" err="invalid bearer token"

Environment

Question

Until you fix it, is there any way to change cluster.endpoint?

smira commented 3 weeks ago

I'm not quite sure what the question is, service-account-issuer is equal to the controlplane endpoint in Talos.

By downtime I guess you mean that service account tokens stop working? (these are used only by workload pods)

As no communication in the cluster between components will be broken if you change the endpoint, a simple pod restart for those using service accounts will be sufficient.

NikolaiBessonov commented 3 weeks ago

@smira not quite that.

For example if I put new address(vip) to endpoint on one of the controlplane node - api server stops working, because can't authenticate. I think if I change all endpoints on all my controlplane nodes, some components will require restart(such as cilium cni components. It also can't auth on controlplane with new endpoint), and it leads to downtime, until restart all components.

But if there was support of adding additional param "service-account-issuer" - where I could specify additional loadbalancer(old) on nodes, it would be without any errors and downtime. Similar to point 8 "Migration from kubeadm. Step-by-Step guide" in docs

smira commented 3 weeks ago

Yes, service-account-issuer might be done better, but I guess it has nothing to do with Cilium.

First of all, Cilium should be configured to use Talos KubePrism endpoint - that's way better than using actual cluster endpoint.

I guess what happens is that chainging endpoint re-rolls kube-apiserver certificate, and old/new certificates don't match for you, which can be solved by updading certSANs, but once again service-token-issuer should be made configurable, but your issue is something else.

NikolaiBessonov commented 2 weeks ago

@smira Sory for delayed response. We're testing it and applied to production cluster. Changing the endpoints(on all three control planes) involves restarting all pods that are connected in any way to the KubeAPI server, such as Cilium, Cert-manager, ingress-controller etc... If you have a simple script, which find all serviceAccounts and their pods, it's will be faster and take at least 1 minute of downtime. Unfortunately, this is the fastest way I found. I have no more questions. We can close the issue, but it will be better if you add the ability to do that without downtime and restart all pods

smira commented 2 weeks ago

I think the issue itself makes sense, and what you would like to see is to have additional values for service-account-issuers, that is previous controlplane endpoint - so that tokens are considered to be valid.