siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.
https://www.talos.dev
Mozilla Public License 2.0
6.39k stars 514 forks source link

Cilium not get installed with TF #8494

Open Issen007 opened 5 months ago

Issen007 commented 5 months ago

Bug Report

I try to deploy a Talos Cluster via Terraform in AWS and as soon I disable Kube-proxy and enable Cilium CNI to be installed as default it get stuck. The only way to install Cilium it create a Talos Cluster with kube-proxy and then install Cilium as a postinstallation.

Description

I'm using following terraform template with some few modification where we create a new VPC. https://github.com/isovalent/terraform-aws-talos

But as soon we deploy the environment it get stuck that Cilium it not get installed. As soon we comment the out

cni = {
          name = "none"
        },

The installation continues and we get a Talos Cluster up and running.

Logs

kubectl describe nodes ip-10-0-4-40 
Name:               ip-10-0-4-40
Roles:              control-plane
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ip-10-0-4-40
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/control-plane=
Annotations:        node.alpha.kubernetes.io/ttl: 0
                    talos.dev/owned-labels: ["node-role.kubernetes.io/control-plane"]
                    talos.dev/owned-taints: ["node-role.kubernetes.io/control-plane"]
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Mon, 25 Mar 2024 13:46:37 +0100
Taints:             node-role.kubernetes.io/control-plane:NoSchedule
                    node.kubernetes.io/not-ready:NoSchedule
Unschedulable:      false
Lease:
  HolderIdentity:  ip-10-0-4-40
  AcquireTime:     <unset>
  RenewTime:       Tue, 26 Mar 2024 09:52:56 +0100
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Tue, 26 Mar 2024 09:51:14 +0100   Mon, 25 Mar 2024 13:46:37 +0100   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Tue, 26 Mar 2024 09:51:14 +0100   Mon, 25 Mar 2024 13:46:37 +0100   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Tue, 26 Mar 2024 09:51:14 +0100   Mon, 25 Mar 2024 13:46:37 +0100   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            False   Tue, 26 Mar 2024 09:51:14 +0100   Mon, 25 Mar 2024 13:46:37 +0100   KubeletNotReady              container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
Addresses:
  InternalIP:  10.0.4.40
  Hostname:    ip-10-0-4-40
Capacity:
  cpu:                2
  ephemeral-storage:  49932Mi
  hugepages-2Mi:      0
  memory:             1959156Ki
  pods:               110
Allocatable:
  cpu:                1950m
  ephemeral-storage:  46853311615
  hugepages-2Mi:      0
  memory:             1660148Ki
  pods:               110
System Info:
  Machine ID:                 6da8f3b88cc8a2fd83019be63f98e76c
  System UUID:                ec251856-1324-369c-330a-ef457cdcd067
  Boot ID:                    60826cf0-98de-44ba-8b34-bfff15e66521
  Kernel Version:             6.1.80-talos
  OS Image:                   Talos (v1.6.6)
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.7.13
  Kubelet Version:            v1.29.2
  Kube-Proxy Version:         v1.29.2
PodCIDR:                      100.64.3.0/24
PodCIDRs:                     100.64.3.0/24
Non-terminated Pods:          (3 in total)
  Namespace                   Name                                    CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                   ----                                    ------------  ----------  ---------------  -------------  ---
  kube-system                 kube-apiserver-ip-10-0-4-40             200m (10%)    0 (0%)      512Mi (31%)      0 (0%)         20h
  kube-system                 kube-controller-manager-ip-10-0-4-40    50m (2%)      0 (0%)      256Mi (15%)      0 (0%)         20h
  kube-system                 kube-scheduler-ip-10-0-4-40             10m (0%)      0 (0%)      64Mi (3%)        0 (0%)         20h
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests     Limits
  --------           --------     ------
  cpu                260m (13%)   0 (0%)
  memory             832Mi (51%)  0 (0%)
  ephemeral-storage  0 (0%)       0 (0%)
  hugepages-2Mi      0 (0%)       0 (0%)
Events:              <none>

Environment

frezbo commented 5 months ago

it looks like csr is enabled, so needs to manually approve the CSR

Issen007 commented 5 months ago

@frezbo You are maybe right, and when I try to approve the pending CSRs I got No resource found

$ kubectl get csr
NAME        AGE   SIGNERNAME                                    REQUESTOR                   REQUESTEDDURATION   CONDITION
csr-2zqft   70s   kubernetes.io/kubelet-serving                 system:node:ip-10-0-5-40    <none>              Pending
csr-497p4   31m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-4-194   <none>              Pending
csr-4ftvr   31m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-5-124   <none>              Pending
csr-6q6sd   46m   kubernetes.io/kube-apiserver-client-kubelet   system:bootstrap:0dfxfl     <none>              Approved,Issued
csr-82kgd   70s   kubernetes.io/kubelet-serving                 system:node:ip-10-0-4-65    <none>              Pending
csr-8hdns   46m   kubernetes.io/kube-apiserver-client-kubelet   system:bootstrap:0dfxfl     <none>              Approved,Issued
csr-b7x77   46m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-6-78    <none>              Pending
csr-c9j4n   46m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-4-194   <none>              Pending
csr-fwfp8   16m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-6-78    <none>              Pending
csr-gbvfs   16m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-5-40    <none>              Pending
csr-jsn69   46m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-4-65    <none>              Pending
csr-k6xss   16m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-4-194   <none>              Pending
csr-l99bw   31m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-6-78    <none>              Pending
csr-mfbmf   31m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-5-40    <none>              Pending
csr-mwr26   46m   kubernetes.io/kube-apiserver-client-kubelet   system:bootstrap:0dfxfl     <none>              Approved,Issued
csr-s8wd5   46m   kubernetes.io/kube-apiserver-client-kubelet   system:bootstrap:0dfxfl     <none>              Approved,Issued
csr-sfpgn   46m   kubernetes.io/kube-apiserver-client-kubelet   system:bootstrap:0dfxfl     <none>              Approved,Issued
csr-shhdn   16m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-4-65    <none>              Pending
csr-t6zst   72s   kubernetes.io/kubelet-serving                 system:node:ip-10-0-4-194   <none>              Pending
csr-t7nqd   46m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-5-124   <none>              Pending
csr-twcg6   71s   kubernetes.io/kubelet-serving                 system:node:ip-10-0-6-78    <none>              Pending
csr-vtnvl   46m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-5-40    <none>              Pending
csr-x52sh   73s   kubernetes.io/kubelet-serving                 system:node:ip-10-0-5-124   <none>              Pending
csr-xs6xd   31m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-4-65    <none>              Pending
csr-xxhwq   16m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-5-124   <none>              Pending

$ kubectl certificate approve csr-2zqft
No resources found
error: the server doesn't have a resource type "certificatesigningrequests"
frezbo commented 5 months ago

I'm not sure about the csr not being found, the error seems super weird

JimKlapwijk commented 5 months ago

As far as I'm aware this is a "known" issue when using an alternative CNI: https://www.talos.dev/v1.6/kubernetes-guides/network/deploying-cilium/#method-1-helm-install

After applying the machine config and bootstrapping Talos will appear to hang on phase 18/19 with the message: retrying error: node not ready. This happens because nodes in Kubernetes are only marked as ready once the CNI is up. As there is no CNI defined, the boot process is pending and will reboot the node to retry after 10 minutes, this is expected behavior.

So you have to manually setup a CNI of your choice.

EDIT: Or host the template file and use it in a patch.