siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.
https://www.talos.dev
Mozilla Public License 2.0
6.69k stars 536 forks source link

Node errors when cluster.controlPlane.endpoint is set to https://kubernetes.default.svc.cluster.local #9029

Open varet80 opened 3 months ago

varet80 commented 3 months ago

I migrated from Kubeadm to Talos.

When I join a new Controlplane node, using the documentation and setting the service-account-issuer to cluster.controlPlane.endpoint. I face an issue with errors in console. The Node cannot join the network.

In the other hand everything, seems to works as should.

If I set the endpoint to https:// and then join the cluster.

APIServer cannot authenticate and generates lines of:

 authentication.go:73] "Unable to authenticate the request" err="invalid bearer token"

The only way to success joining without any errors

In screenshot I joined a node with endpoint controller set to internal hostname. which is not resolvable.

this is a screenshot from the VM.

Screenshot 2024-07-17 at 2 02 09 PM

smira commented 3 months ago

It doesn't make sense to set the controlplane endpoint to kubernetes.default.svc.cluster.local in any case, as this is the external way to access the controlplane (not from within a Kubernetes pod).

If your case is to update service-account-issuer, let's make this field configurable in Talos.

varet80 commented 3 months ago

I can set the SErvice account issuer to the internal one. but then there are a lot of other errors as it is different to the endpoint.

Also the service-account-issuer should use the internal or external endpoint in the case of kube-apiserver?

smira commented 3 months ago

You can find it in the Kubernetes documentation:

Identifier of the service account token issuer. The issuer will assert this identifier in "iss" claim of issued tokens. This value is a string or URI. If this option is not a valid URI per the OpenID Discovery 1.0 spec, the ServiceAccountIssuerDiscovery feature will remain disabled, even if the feature gate is set to true. It is highly recommended that this value comply with the OpenID spec: https://openid.net/specs/openid-connect-discovery-1_0.html. In practice, this means that service-account-issuer must be an https URL. It is also highly recommended that this URL be capable of serving OpenID discovery documents at {service-account-issuer}/.well-known/openid-configuration. When this flag is specified multiple times, the first is used to generate tokens and all are used to determine which issuers are accepted.

So it's not clear what you're trying to solve, but setting controlplane endpoint to kubernetes service DNS name is certainly wrong way.

varet80 commented 3 months ago

I agree. I am just confused what is the best action here. As the instructions of kubeadm state: Make sure that, on your current Kubeadm cluster, the first --service-account-issuer= parameter in /etc/kubernetes/manifests/kube-apiserver.yaml is equal to the value of .cluster.controlPlane.endpoint in controlplane.yaml. If it’s not, add a new --service-account-issuer= parameter with the correct value before your current one in /etc/kubernetes/manifests/kube-apiserver.yaml on all of your control planes nodes, and restart the kube-apiserver containers. https://www.talos.dev/v1.7/advanced/migrating-from-kubeadm/#step-by-step-guide that is the internal for kubeadm (at least for many cases)

In contrary, Boostraping a node with the right ControlPlane endpoint (Load balancer endpoint). leads to apiserver complaining about the token issue, as the url is not the same as before. "Unable to authenticate the request" err="invalid bearer token" this happens because the apiserver param --service-account-issuer Is also set to LB endpoint. If this is also the best practice,

If i change, after the node is ready the machine config to the internal url, everything starts working.

Keeping the control plane on public endpoint and adding an extra Argument, into the internal endpoint, for that api server leads to more issues, complaining about mismatch, as this way it registers to access both issuer endpoints.

probably having ability to override the parameter, would help to avoid these cases

smira commented 3 months ago

I'm not quite sure how "the url is not the same as before", if you specify loadbalancer endpoint, as all nodes will have URL for the controlplane endpoint, and, transitively for the service account issuer.

varet80 commented 3 months ago

it turns out, using the LB endpoint on all APIs stops the error. a Migration for KubeADM could be: update first apiservers to your LB endpoint and then begin the migration. Can I submit some Documentation updates, for potential errors? in order to help people migrating?

steverfrancis commented 3 months ago

Yes, PRs are always appreciated! The file is at https://github.com/siderolabs/talos/blob/main/website/content/v1.8/advanced/migrating-from-kubeadm.md