siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.
https://www.talos.dev
Mozilla Public License 2.0
6.47k stars 517 forks source link

Talos HA Installation not working with cilium #9128

Closed Syntax3rror404 closed 1 month ago

Syntax3rror404 commented 1 month ago

Bug Report

Im coming from Rancher with the rke2 engine. This is my first talos installation and nothing works. I working now more then 5 years with kubernetes in a enterprise environment. So what im missing? Or is this a bug?

I basicly want a simple HA Kubernetes setup without kube-proxy and kube-proxy replacment from cilium. Which sould be possible: https://www.talos.dev/v1.7/kubernetes-guides/network/deploying-cilium/

Description

Ok back to the basics. I have a DNS entry to the 3 master nodes. This master nodes should be the control plane with the api server.

So the gen command should be: talosctl gen config talos https://mgm.talos-ha.lab:6443 --config-patch "@patch.yaml"

# patch.yaml
cluster:
  network:
    cni:
      name: none
  proxy:
    disabled: true

The next step is to push to the init master node the configuration. talosctl apply-config --insecure --nodes 192.168.35.60 --file controlplane.yaml

The master node not gets ready. But this should be ok because the cluster first needs to be cilium installed to get the coreDNS up and running.

Then I bootstrap with the ha endpoint:

talosctl bootstrap --nodes 192.168.35.60 --endpoints mgm.talos-ha.lab --talosconfig=./talosconfig
error executing bootstrap: rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: tls: failed to verify certificate: x509: certificate is valid for talos-z5l-ka4, not mgm.talos-ha.lab"

Interesting but the machineconfig from the controlplane has a san

apiServer:
    image: registry.k8s.io/kube-apiserver:v1.30.3 # The container image used in the API server manifest.
    # Extra certificate subject alternative names for the API server's certificate.
    certSANs:
        - mgm.talos-ha.lab

Ok ... then with the IP from the first node? talosctl bootstrap --nodes 192.168.35.60

Oh exit code 0 ok cool.

Ok, talos can I now have please the kubeconfig to install cilium?

talosctl -n mgm.talos-ha.lab kubeconfig
error constructing client: failed to determine endpoints

Hmm.. ok and again with the first controlplane ip?

talosctl -n 192.168.35.60 kubeconfig
error constructing client: failed to determine **endpoints**

Ok. So I have now idea what this is? Is this a bug?

I dont have a kubeconfig or anything. The cluster hangs in the air.

I tryed it several times again and again with the same end.

And I have a other question Do I need KubePrism == true? I want to use the cilium L4 Load Balancer.

In this example KubePrism is true.

Dashboard

 "kubelet" to be "up", service "machined" to be "up", service "syslogd" to be "up", service "trustd" to be "up", service "udevd" to be "up"                                                                                                      
 user: warning: [2024-08-08T00:39:53.602742813Z]: [talos] service[trustd](Running): Started task trustd (PID 2375) for container trustd                                                                                                          
 user: warning: [2024-08-08T00:39:53.603706813Z]: [talos] service[apid](Running): Started task apid (PID 2376) for container apid                                                                                                                
 user: warning: [2024-08-08T00:39:53.659478813Z]: [talos] service[kubelet](Waiting): Waiting for service "cri" to be "up"                                                                                                                        
 user: warning: [2024-08-08T00:39:53.708865813Z]: [talos] service[cri](Running): Health check successful                                                                                                                                         
 user: warning: [2024-08-08T00:39:53.709439813Z]: [talos] service[etcd](Preparing): Running pre state                                                                                                                                            
 user: warning: [2024-08-08T00:39:53.709977813Z]: [talos] service[kubelet](Preparing): Running pre state                                                                                                                                         
 user: warning: [2024-08-08T00:39:53.715628813Z]: [talos] service[trustd](Running): Health check failed: dial tcp 127.0.0.1:50001: connect: connection refused                                                                                   
 user: warning: [2024-08-08T00:39:53.726596813Z]: [talos] service[etcd](Preparing): Creating service runner                                                                                                                                      
 user: warning: [2024-08-08T00:39:53.727739813Z]: [talos] service[kubelet](Preparing): Creating service runner                                                                                                                                   
 user: warning: [2024-08-08T00:39:53.842123813Z]: [talos] kubernetes endpoint watch error {"component": "controller-runtime", "controller": "k8s.EndpointController", "error": "failed to list *v1.Endpoints: Get \"https://mgm.talos.labza:     
 6443/api/v1/namespaces/default/endpoints?fieldSelector=metadata.name%3Dkubernetes&limit=500&resourceVersion=0\": dial tcp 192.168.35.60:6443: connect: connection refused"}                                                                     
 user: warning: [2024-08-08T00:39:53.868153813Z]: [talos] service[kubelet](Running): Started task kubelet (PID 2445) for container kubelet                                                                                                       
 user: warning: [2024-08-08T00:39:53.886537813Z]: [talos] service[etcd](Running): Started task etcd (PID 2461) for container etcd                                                                                                                
 user: warning: [2024-08-08T00:39:55.746212813Z]: [talos] service[kubelet](Running): Health check successful                                                                                                                                     
 user: warning: [2024-08-08T00:39:56.747117813Z]: [talos] kubernetes endpoint watch error {"component": "controller-runtime", "controller": "k8s.EndpointController", "error": "failed to list *v1.Endpoints: Get \"https://mgm.talos.labza:     
 6443/api/v1/namespaces/default/endpoints?fieldSelector=metadata.name%3Dkubernetes&limit=500&resourceVersion=0\": dial tcp 192.168.35.60:6443: connect: connection refused"}                                                                     
 user: warning: [2024-08-08T00:39:58.196113813Z]: [talos] service[apid](Running): Health check successful                                                                                                                                        
 user: warning: [2024-08-08T00:39:58.680323813Z]: [talos] service[trustd](Running): Health check successful                                                                                                                                      
 user: warning: [2024-08-08T00:39:58.735584813Z]: [talos] service[etcd](Running): Health check successful                                                                                                                                        
 user: warning: [2024-08-08T00:39:58.736203813Z]: [talos] task startAllServices (1/1): done, 6.068595451s                                                                                                                                        
 user: warning: [2024-08-08T00:39:58.737230813Z]: [talos] phase startEverything (16/16): done, 6.07011153s                                                                                                                                       
 user: warning: [2024-08-08T00:39:58.737782813Z]: [talos] boot sequence: done: 6.55773218s                                                                                                                                                       
 user: warning: [2024-08-08T00:39:58.739690813Z]: [talos] rendered new static pod {"component": "controller-runtime", "controller": "k8s.StaticPodServerController", "id": "kube-apiserver"}                                                     
 user: warning: [2024-08-08T00:39:58.741965813Z]: [talos] rendered new static pod {"component": "controller-runtime", "controller": "k8s.StaticPodServerController", "id": "kube-controller-manager"}                                            
 user: warning: [2024-08-08T00:39:58.743259813Z]: [talos] rendered new static pod {"component": "controller-runtime", "controller": "k8s.StaticPodServerController", "id": "kube-scheduler"}                                                     
 user: warning: [2024-08-08T00:39:59.986571813Z]: [talos] controller failed {"component": "controller-runtime", "controller": "k8s.KubeletStaticPodController", "error": "error refreshing pod status: error fetching pod status: an error on    
 the server (\"Authorization error (user=apiserver-kubelet-client, verb=get, resource=nodes, subresource=proxy)\") has prevented the request from succeeding"}                                                                                   
 user: warning: [2024-08-08T00:40:01.517808813Z]: [talos] kubernetes endpoint watch error {"component": "controller-runtime", "controller": "k8s.EndpointController", "error": "failed to list *v1.Endpoints: Get \"https://mgm.talos.labza:     
 6443/api/v1/namespaces/default/endpoints?fieldSelector=metadata.name%3Dkubernetes&limit=500&resourceVersion=0\": dial tcp 192.168.35.60:6443: connect: connection refused"}                                                                     
 user: warning: [2024-08-08T00:40:02.297829813Z]: [talos] controller failed {"component": "controller-runtime", "controller": "k8s.NodeApplyController", "error": "1 error(s) occurred:\n\ttimeout"}                                             
 user: warning: [2024-08-08T00:40:12.048897813Z]: [talos] kubernetes endpoint watch error {"component": "controller-runtime", "controller": "k8s.EndpointController", "error": "failed to list *v1.Endpoints: Get \"https://mgm.talos.labza:     
 6443/api/v1/namespaces/default/endpoints?fieldSelector=metadata.name%3Dkubernetes&limit=500&resourceVersion=0\": dial tcp 192.168.35.60:6443: connect: connection refused"}                                                                     
 user: warning: [2024-08-08T00:40:12.742748813Z]: [talos] controller failed {"component": "controller-runtime", "controller": "k8s.NodeApplyController", "error": "1 error(s) occurred:\n\ttimeout"}                                             
 user: warning: [2024-08-08T00:40:15.651758813Z]: [talos] controller failed {"component": "controller-runtime", "controller": "k8s.KubeletStaticPodController", "error": "error refreshing pod status: error fetching pod status: an error on    
 the server (\"Authorization error (user=apiserver-kubelet-client, verb=get, resource=nodes, subresource=proxy)\") has prevented the request from succeeding"}                                                                                   
 user: warning: [2024-08-08T00:40:23.341334813Z]: [talos] controller failed {"component": "controller-runtime", "controller": "k8s.NodeApplyController", "error": "1 error(s) occurred:\n\ttimeout"}                                             
 user: warning: [2024-08-08T00:40:31.343130813Z]: [talos] controller failed {"component": "controller-runtime", "controller": "k8s.KubeletStaticPodController", "error": "error refreshing pod status: error fetching pod status: an error on    
 the server (\"Authorization error (user=apiserver-kubelet-client, verb=get, resource=nodes, subresource=proxy)\") has prevented the request from succeeding"}                                                                                   
 user: warning: [2024-08-08T00:40:34.382612813Z]: [talos] controller failed {"component": "controller-runtime", "controller": "k8s.NodeApplyController", "error": "1 error(s) occurred:\n\ttimeout"}                                             
 user: warning: [2024-08-08T00:40:45.415612813Z]: [talos] controller failed {"component": "controller-runtime", "controller": "k8s.NodeApplyController", "error": "1 error(s) occurred:\n\ttimeout"}                                             
 user: warning: [2024-08-08T00:40:46.781869813Z]: [talos] controller failed {"component": "controller-runtime", "controller": "k8s.KubeletStaticPodController", "error": "error refreshing pod status: error fetching pod status: an error on    
 the server (\"Authorization error (user=apiserver-kubelet-client, verb=get, resource=nodes, subresource=proxy)\") has prevented the request from succeeding"}                                                                                   
 user: warning: [2024-08-08T00:40:52.245607813Z]: [talos] node watch error {"component": "controller-runtime", "controller": "k8s.NodeStatusController", "error": "failed to list *v1.Node: Get \"https://127.0.0.1:7445/api/v1/nodes?           
 fieldSelector=metadata.name%3Dtalos-z5l-ka4&limit=500&resourceVersion=0\": EOF", "error_count": 0} 

Environment

Syntax3rror404 commented 1 month ago

Got it, the san is for the cluster and ! for the machine talos endpoint.

Is KubePrism needed? Or is this only needed for Talos endpoint API HA?

The kubeapi is via DNS HA?

ccben87 commented 1 month ago

@Syntax3rror404

My views:

I suggest you look at setting up your cluster with a floating VIP, which is controlled with etcd elections: https://www.talos.dev/v1.7/talos-guides/network/vip/

You should use KubePrism, yes, it's enabled by default. I wouldn't disable it if I were you. It's recommended you use this for Cilium in particular. See https://www.talos.dev/v1.7/kubernetes-guides/configuration/kubeprism/

For talosctl, you don't need to worry about load balancing. Your talosconfig file will contain all your control plane node IPs and I quote from the documentation: The talosctl tool provides built-in client-side load-balancing across control plane nodes, so usually you do not need to configure a load balancer for the Talos API.

Steps:

  1. Generate config.
  2. Apply config to each HA node (you'll probably have a 3 node cluster)
  3. Create any worker nodes and apply config.
  4. Run the following:
    export TALOSCONFIG="_out/talosconfig"
    talosctl config endpoint $CONTROL_PLANE_IP
    talosctl config node $CONTROL_PLANE_IP
  5. Run talosctl bootstrap and wait a little for booting
  6. Retrieve kubeconfig talosctl kubeconfig . and then cp kubeconfig ~/.kube/config
  7. Install Cilium as per https://www.talos.dev/v1.7/kubernetes-guides/network/deploying-cilium/#machine-config-preparation

I recommend Method 1 because Cilium has moved from supporting CLI based installer and Helm to just Helm installs with the Helm install being done by the CLI if you did use the CLI. Also, I note when you use methods 2, 3, or 4 that Cilium doesn't show up as an installed Helm package which would make it annoying to handle an upgrade. 5 is new to me and might work well if on the right Cilium version where it uses Helm underneath for the CLI installer.

Syntax3rror404 commented 1 month ago

@ccben87

Wow ok this is extremly cool ....... Got it working now with VIP.

I use Rancher on top of Talos. Rancher is not that fast with updating to the newest kubernetes version.

Do you have some recommendation about updating the os and kubernetes?

ccben87 commented 1 month ago

Do you have some recommendation about updating the os and kubernetes?

I would recommend you update Talos when a new stable version is out. Maybe don't update to the first major version and wait for a few minors before updating to the major version.

With regards to Kubernetes, that's more of a complicated question and will depend on your workloads and if there are any features / fixes you want in the newer versions. You should test your workloads in non-prod (which should be as identical to prod as you can make it) and if they work there then feel free to upgrade as new releases come. Otherwise, you'll need to pay attention to whether your workloads will work with the new Kubernetes version. A good example of why you should be cautious with your workloads is with Postgres around this https://www.linkedin.com/pulse/kubernetes-silent-pod-killer-invisible-oom-kill-containers-secondary-gccle where you may find you may run into problems if your Postgres workload generates an OOM due to your limits for the workload in K8s.

Usually it's a good idea to adopt an N-1 update posture for most updates for most software.