Calico Node Pod Failing - cni-installer/<nil> <nil>: Unable to create token for CNI kubeconfig error=Unauthorized

projectcalico / calico

Cloud native networking and network security

https://docs.tigera.io/calico/latest/about/

Apache License 2.0

5.69k stars 1.27k forks source link

Calico Node Pod Failing - cni-installer/<nil> <nil>: Unable to create token for CNI kubeconfig error=Unauthorized #8368

Closed shashank-omre-cldcvr closed 3 weeks ago

shashank-omre-cldcvr commented 6 months ago

Fresh Calico Operator(v3.27.0) Installation in 1.24 EKS cluster.

kubectl create -f https://raw.githubusercontent.com/projectcalico/calico/v3.27.0/manifests/tigera-operator.yaml

NAME                                       READY   STATUS                  RESTARTS        AGE
calico-kube-controllers-6bf98f54bb-bqtmq   1/1     Running                 0               15h
calico-node-47bvv                          0/1     Init:CrashLoopBackOff   167 (75s ago)   13h
calico-node-7wll8                          0/1     Init:CrashLoopBackOff   2 (23s ago)     70s
calico-node-br5qp                          0/1     Init:CrashLoopBackOff   6 (2m26s ago)   8m16s
calico-node-gln78                          0/1     Init:CrashLoopBackOff   8 (5m13s ago)   22m
calico-typha-5d4689887f-s9qkr              1/1     Running                 0               15h
calico-typha-5d4689887f-wdgdg              1/1     Running                 0               15h
csi-node-driver-26z8d                      2/2     Running                 0               15h
csi-node-driver-pzlhk                      2/2     Running                 0               15h
csi-node-driver-r4jtl                      2/2     Running                 0               71s
csi-node-driver-x8qn4                      2/2     Running                 0               15h

Expected Behavior

All Pods should we be running state

Current Behavior

calico-node pod throwing this Error.

2023-12-21 08:32:31.622 [ERROR][1] cni-installer/<nil> <nil>: Unable to create token for CNI kubeconfig error=Unauthorized
2023-12-21 08:32:31.622 [FATAL][1] cni-installer/<nil> <nil>: Unable to create token for CNI kubeconfig error=Unauthorized

Possible Solution

Steps to Reproduce (for bugs)

It's Fresh installation

2023-12-21 08:32:31.622 [ERROR][1] cni-installer/<nil> <nil>: Unable to create token for CNI kubeconfig error=Unauthorized
2023-12-21 08:32:31.622 [FATAL][1] cni-installer/<nil> <nil>: Unable to create token for CNI kubeconfig error=Unauthorized

Context

Your Environment

Calico version - v3.27.0
Orchestrator version (e.g. kubernetes, mesos, rkt): EKS 1.24

caseydavenport commented 6 months ago

Looks like a potential RBAC issue - the cni-installer needs permissions to provision new tokens for the CNI plugin to use.

What does this show:

kubectl get clusterrole calico-node -o yaml

shashank-omre-cldcvr commented 6 months ago

@caseydavenport

 ➜  cc-helm-charts kubectl get clusterrole calico-node -o yaml    
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  creationTimestamp: "2023-12-20T16:37:09Z"
  finalizers:
  - tigera.io/cni-protector
  name: calico-node
  ownerReferences:
  - apiVersion: operator.tigera.io/v1
    blockOwnerDeletion: true
    controller: true
    kind: Installation
    name: default
    uid: 83eca255-2014-438c-af6d-1d22c1f0a986
  resourceVersion: "173435260"
  uid: 4f399acf-e2e0-4efc-ab07-6b524be4b60a
rules:
- apiGroups:
  - discovery.k8s.io
  resources:
  - endpointslices
  verbs:
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - pods
  - nodes
  - namespaces
  verbs:
  - get
- apiGroups:
  - ""
  resources:
  - endpoints
  - services
  verbs:
  - watch
  - list
  - get
- apiGroups:
  - ""
  resources:
  - nodes/status
  verbs:
  - patch
  - update
- apiGroups:
  - networking.k8s.io
  resources:
  - networkpolicies
  verbs:
  - watch
  - list
- apiGroups:
  - ""
  resources:
  - pods
  - namespaces
  - serviceaccounts
  verbs:
  - watch
  - list
- apiGroups:
  - ""
  resources:
  - pods/status
  verbs:
  - patch
- apiGroups:
  - ""
  resourceNames:
  - calico-cni-plugin
  resources:
  - serviceaccounts/token
  verbs:
  - create
- apiGroups:
  - ""
  resources:
  - configmaps
  verbs:
  - get
- apiGroups:
  - crd.projectcalico.org
  resources:
  - bgpfilters
  - bgpconfigurations
  - bgppeers
  - bgpfilters
  - blockaffinities
  - clusterinformations
  - felixconfigurations
  - globalnetworkpolicies
  - stagedglobalnetworkpolicies
  - globalnetworksets
  - hostendpoints
  - ipamblocks
  - ippools
  - ipreservations
  - networkpolicies
  - stagedkubernetesnetworkpolicies
  - stagednetworkpolicies
  - networksets
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - crd.projectcalico.org
  resources:
  - caliconodestatuses
  verbs:
  - get
  - list
  - watch
  - update
- apiGroups:
  - crd.projectcalico.org
  resources:
  - globalbgpconfigs
  - globalfelixconfigs
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - crd.projectcalico.org
  resources:
  - clusterinformations
  - felixconfigurations
  - ippools
  verbs:
  - create
  - update
- apiGroups:
  - ""
  resources:
  - nodes
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - crd.projectcalico.org
  resources:
  - blockaffinities
  - ipamblocks
  - ipamhandles
  - ipamconfigs
  verbs:
  - get
  - list
  - create
  - update
  - delete
- apiGroups:
  - crd.projectcalico.org
  resources:
  - ipamconfigs
  verbs:
  - get
- apiGroups:
  - crd.projectcalico.org
  resources:
  - blockaffinities
  verbs:
  - watch
- apiGroups:
  - policy
  resourceNames:
  - calico-node
  resources:
  - podsecuritypolicies
  verbs:
  - use

jon-nfc commented 5 months ago

Is there a fix for this? same problem, clean install using the operator. kubectl create -f https://raw.githubusercontent.com/projectcalico/calico/v3.27.0/manifests/tigera-operator.yaml I even tried the the master branch manifest, still to no avail. only version that works is non-operator install with manifest 3.25.0.

root@7dd5f0bf32b32b93:/home/deploy# kubectl logs -f -n calico-system calico-node-wmdhj -c install-cni
2024-01-27 09:15:55.401 [INFO][1] cni-installer/<nil> <nil>: Running as a Kubernetes pod
2024-01-27 09:15:56.826 [INFO][1] cni-installer/<nil> <nil>: File is already up to date, skipping file="/host/opt/cni/bin/bandwidth"
2024-01-27 09:15:56.826 [INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/bandwidth
2024-01-27 09:15:56.955 [INFO][1] cni-installer/<nil> <nil>: File is already up to date, skipping file="/host/opt/cni/bin/calico"
2024-01-27 09:15:56.955 [INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/calico
2024-01-27 09:15:57.058 [INFO][1] cni-installer/<nil> <nil>: File is already up to date, skipping file="/host/opt/cni/bin/calico-ipam"
2024-01-27 09:15:57.058 [INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/calico-ipam
2024-01-27 09:15:57.063 [INFO][1] cni-installer/<nil> <nil>: File is already up to date, skipping file="/host/opt/cni/bin/flannel"
2024-01-27 09:15:57.063 [INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/flannel
2024-01-27 09:15:57.072 [INFO][1] cni-installer/<nil> <nil>: File is already up to date, skipping file="/host/opt/cni/bin/host-local"
2024-01-27 09:15:57.072 [INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/host-local
2024-01-27 09:15:57.202 [INFO][1] cni-installer/<nil> <nil>: File is already up to date, skipping file="/host/opt/cni/bin/install"
2024-01-27 09:15:57.202 [INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/install
2024-01-27 09:15:57.208 [INFO][1] cni-installer/<nil> <nil>: File is already up to date, skipping file="/host/opt/cni/bin/loopback"
2024-01-27 09:15:57.208 [INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/loopback
2024-01-27 09:15:57.215 [INFO][1] cni-installer/<nil> <nil>: File is already up to date, skipping file="/host/opt/cni/bin/portmap"
2024-01-27 09:15:57.215 [INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/portmap
2024-01-27 09:15:57.221 [INFO][1] cni-installer/<nil> <nil>: File is already up to date, skipping file="/host/opt/cni/bin/tuning"
2024-01-27 09:15:57.221 [INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/tuning
2024-01-27 09:15:57.221 [INFO][1] cni-installer/<nil> <nil>: Wrote Calico CNI binaries to /host/opt/cni/bin

2024-01-27 09:15:57.260 [INFO][1] cni-installer/<nil> <nil>: CNI plugin version: v3.27.0

2024-01-27 09:15:57.260 [INFO][1] cni-installer/<nil> <nil>: /host/secondary-bin-dir is not writeable, skipping
2024-01-27 09:15:57.260 [WARNING][1] cni-installer/<nil> <nil>: Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2024-01-27 09:15:57.284 [ERROR][1] cni-installer/<nil> <nil>: Unable to create token for CNI kubeconfig error=Unauthorized
2024-01-27 09:15:57.284 [FATAL][1] cni-installer/<nil> <nil>: Unable to create token for CNI kubeconfig error=Unauthorized
root@7dd5f0bf32b32b93:/home/deploy#

caseydavenport commented 5 months ago

For anyone encountering this issue, please run the following command to determine if your cluster's RBAC is correct:

kubectl auth can-i create serviceaccounts/calico-cni-plugin -n calico-system --subresource token --as "system:serviceaccount:calico-system:calico-node"

This should return yes, indicating that calico-node is allowed to create tokens for the CNI plugin. If this returns yes, then Calico's RBAC is configured correctly and the Unauthorized likely means something else - for example, bad certificates being provided to the containers.

If this returns no, that indicates that something is wrong with the RBAC in the cluster.

Brice187 commented 4 months ago

15:38:15 lars@d04  ~ » devkubectl -n calico-system logs calico-node-7wvn8 -c install-cni
2024-02-16 14:38:11.600 [INFO][1] cni-installer/<nil> <nil>: Running as a Kubernetes pod
2024-02-16 14:38:11.620 [INFO][1] cni-installer/<nil> <nil>: File is already up to date, skipping file="/host/opt/cni/bin/bandwidth"
2024-02-16 14:38:11.620 [INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/bandwidth
2024-02-16 14:38:11.807 [INFO][1] cni-installer/<nil> <nil>: File is already up to date, skipping file="/host/opt/cni/bin/calico"
2024-02-16 14:38:11.807 [INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/calico
2024-02-16 14:38:11.977 [INFO][1] cni-installer/<nil> <nil>: File is already up to date, skipping file="/host/opt/cni/bin/calico-ipam"
2024-02-16 14:38:11.977 [INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/calico-ipam
2024-02-16 14:38:11.989 [INFO][1] cni-installer/<nil> <nil>: File is already up to date, skipping file="/host/opt/cni/bin/flannel"
2024-02-16 14:38:11.989 [INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/flannel
2024-02-16 14:38:12.005 [INFO][1] cni-installer/<nil> <nil>: File is already up to date, skipping file="/host/opt/cni/bin/host-local"
2024-02-16 14:38:12.005 [INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/host-local
2024-02-16 14:38:12.016 [INFO][1] cni-installer/<nil> <nil>: File is already up to date, skipping file="/host/opt/cni/bin/loopback"
2024-02-16 14:38:12.016 [INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/loopback
2024-02-16 14:38:12.030 [INFO][1] cni-installer/<nil> <nil>: File is already up to date, skipping file="/host/opt/cni/bin/portmap"
2024-02-16 14:38:12.031 [INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/portmap
2024-02-16 14:38:12.042 [INFO][1] cni-installer/<nil> <nil>: File is already up to date, skipping file="/host/opt/cni/bin/tuning"
2024-02-16 14:38:12.043 [INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/tuning
2024-02-16 14:38:12.043 [INFO][1] cni-installer/<nil> <nil>: Wrote Calico CNI binaries to /host/opt/cni/bin

2024-02-16 14:38:12.098 [INFO][1] cni-installer/<nil> <nil>: CNI plugin version: v3.28.0-0.dev-386-g3a8d575515b1

2024-02-16 14:38:12.098 [INFO][1] cni-installer/<nil> <nil>: /host/secondary-bin-dir is not writeable, skipping
2024-02-16 14:38:12.098 [WARNING][1] cni-installer/<nil> <nil>: Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2024-02-16 14:38:12.112 [ERROR][1] cni-installer/<nil> <nil>: Unable to create token for CNI kubeconfig error=Unauthorized
2024-02-16 14:38:12.112 [FATAL][1] cni-installer/<nil> <nil>: Unable to create token for CNI kubeconfig error=Unauthorized

15:38:23 lars@d04  ~ » devkubectl auth can-i create serviceaccounts/calico-cni-plugin -n calico-system --subresource token --as "system:serviceaccount:calico-system:calico-node"
yes

Enquier commented 4 months ago

@caseydavenport I am getting a similar result as @Brice187 , using an Operator install on 3.27.0

2024-03-04 21:17:05.825 [ERROR][58] cni-config-monitor/token_watch.go 114: Unable to create token for CNI kubeconfig error=Unauthorized
2024-03-04 21:17:05.825 [ERROR][58] cni-config-monitor/token_watch.go 138: Failed to update CNI token, retrying... error=Unauthorized
2024-03-04 21:17:15.265 [ERROR][58] cni-config-monitor/token_watch.go 114: Unable to create token for CNI kubeconfig error=Unauthorized
2024-03-04 21:17:15.265 [ERROR][58] cni-config-monitor/token_watch.go 138: Failed to update CNI token, retrying... error=Unauthorized
2024-03-04 21:17:22.037 [ERROR][58] cni-config-monitor/token_watch.go 114: Unable to create token for CNI kubeconfig error=Unauthorized
2024-03-04 21:17:22.037 [ERROR][58] cni-config-monitor/token_watch.go 138: Failed to update CNI token, retrying... error=Unauthorized
2024-03-04 21:17:30.585 [ERROR][58] cni-config-monitor/token_watch.go 114: Unable to create token for CNI kubeconfig error=Unauthorized
2024-03-04 21:17:30.585 [ERROR][58] cni-config-monitor/token_watch.go 138: Failed to update CNI token, retrying... error=Unauthorized

[user@host ~]# kubectl auth can-i create serviceaccounts/calico-cni-plugin -n calico-system --subresource token --as "system:serviceaccount:calico-system:calico-node"
yes

Enquier commented 4 months ago

It seems to happen consistently across all of my nodes. So far the only thing that temporarily (sometimes?) fixes the issue is to do:

rm -rf /etc/cni/net.d/* && systemctl restart kubelet

caseydavenport commented 4 months ago

The fact that authorization believes that the CNI plugin is authorized to create tokens suggests this is likely a problem with the certificates and/or tokens being used by the CNI plugin in order to make the request, rather than RBAC configuration itself.

I'd first check that your node clocks are properly set and synchronized - this can cause issues where a valid token or certificate can be treated as invalid if the authorizing node's clock is not set properly.

Another case that I have seen this happen is if nodes or other state have been restored from a backup in some capacity, resulting in cached secrets, etc. that are no longer valid.

caseydavenport commented 4 months ago

Other things that would be interesting to see:

Parse the JWT in the /etc/cni/net.d/10-calico.conflist file to see if it looks valid, and what its expiry time is.
See if you can find the JWT being used by calico-node and do the same. This might be trickier as I am not sure it's ever written to disk, but it might be possible.
Enable verbose logging in the API server and see if it provides any more context as to why it's rejecting the requests.

caseydavenport commented 4 months ago

Is everyone encountering this issue running on EKS?

Enquier commented 4 months ago

@caseydavenport I am running on AWS (but not EKS)

Enquier commented 4 months ago

There is no JWT in my file:

{
                          "name": "k8s-pod-network",
                          "cniVersion": "0.3.1",
                          "plugins": [{"container_settings":{"allow_ip_forwarding":false},"datastore_type":"kubernetes","ipam":{"assign_ipv4":"true","assign_ipv6":"false","type":"calico-ipam"},"kubernetes":{"k8s_api_root":"https://20.96.0.1:443","kubeconfig":"/etc/cni/net.d/calico-kubeconfig"},"log_file_max_age":30,"log_file_max_count":10,"log_file_max_size":100,"log_file_path":"/var/log/calico/cni/cni.log","log_level":"Info","mtu":0,"nodename_file_optional":false,"policy":{"type":"k8s"},"type":"calico"},{"capabilities":{"bandwidth":true},"type":"bandwidth"},{"capabilities":{"portMappings":true},"snat":true,"type":"portmap"}]
                        }

There is in the kubeconfig:

{"header": {"alg":"RS256","kid":"<key_id>"}, "payload": {"aud":["https://kubernetes.default.svc.k8s-k8s"],"exp":1709681063,"iat":1709594663,"iss":"https://kubernetes.default.svc.k8s-k8s","kubernetes.io":{"namespace":"calico-system","serviceaccount":{"name":"calico-cni-plugin","uid":"<uid>"}},"nbf":1709594663,"sub":"system:serviceaccount:calico-system:calico-cni-plugin"}}

caseydavenport commented 4 months ago

There is no JWT in my file:

D'oh, yes it should be in the kubeconfig.

That token has:

Issued at: Mon Mar 04 2024 23:24:23 GMT+0000
Expires:   Tue Mar 05 2024 23:24:23 GMT+0000

Which checks out, but unfortunately doesn't tell us much about the calico-node token other than that the calico-node token was valid at the "Issued at" time. The calico-node token is the one that is ultimately the one in question here. Not sure yet how to get access to that one.

The other avenue to explore here is if Kubenretes is having trouble refreshing the token it assigns to calico/node.

Kubernetes gets a short-lived, automatically rotating token using the TokenRequest API and mounts the token as a projected volume.

If there is an issue with the TokenRequest or with the projected volume it might manifest in this way. Kubernetes / kubelet / apiserver logs might show more.

tomastigera commented 3 months ago

Any update on this?

clbx commented 3 months ago

I also ran into this after a power outage, RBAC is valid. Running on Debian 12, Kubernetes 1.28.2, Calico 3.26.4 using Tigera Operator 1.30.9

the JWT in the kubeconfig is not expired yet though, its still valid for another 19 hours.

I have this error in my API Server

authentication.go:70] "Unable to authenticate the request" err="[invalid bearer token, service account calico-system/calico-node has been deleted]"

but that Service Account does still exist.

Deleting the Service Account and allowing the operator to re-create it appears to have solved the issue for me.

I also tried @Enquier's fix of deleting the cni dir and restarting the kubelet, but this didn't fix it for me immediately.

Not sure why this is happening, but I think it has something to do with the operator keeping the service account alive while k8s api thinks its removed. As soon as I removed the finalizer from the svc acct, it was deleted and re-created

caseydavenport commented 3 months ago

Not sure why this is happening, but I think it has something to do with the operator keeping the service account alive while k8s api thinks its removed. As soon as I removed the finalizer from the svc acct, it was deleted and re-created

This is an awesome piece of the puzzle, thank you!

Do you have any idea why the service account may have been terminating in the first place? I wouldn't expect a power outage to cause that.

There is a known issue preventing graceful termination of that service account in the case that the Installation resource is deleted, that will be fixed in v3.28. I wonder if that is related, or if the use of a finalizer on that serviceaccount more generally is related to this issue in other clusters.

For anyone still encountering this issue, could you please show the output of this command as well? @Enquier @Brice187 @jon-nfc

kubectl get serviceaccount -n calico-system calico-node -o yaml

clbx commented 3 months ago

Do you have any idea why the service account may have been terminating in the first place?

No idea, unfortunately I don't have any logs from before the outage, I didn't get to it for a few hours and the api server logs don't stay for very long.

DJanyavula commented 2 months ago

I am also facing the same issue

2024-04-17 08:57:17.484 [INFO][1] cni-installer/<nil> <nil>: Wrote Calico CNI binaries to /host/opt/cni/bin

2024-04-17 08:57:17.596 [INFO][1] cni-installer/<nil> <nil>: CNI plugin version: v3.27.0

2024-04-17 08:57:17.596 [INFO][1] cni-installer/<nil> <nil>: /host/secondary-bin-dir is not writeable, skipping
2024-04-17 08:57:17.597 [WARNING][1] cni-installer/<nil> <nil>: Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2024-04-17 08:57:17.712 [ERROR][1] cni-installer/<nil> <nil>: Unable to create token for CNI kubeconfig error=serviceaccounts "calico-cni-plugin" is forbidden: User "system:serviceaccount:kube-system:calico-node" cannot create resource "serviceaccounts/token" in API group "" in the namespace "kube-system"
2024-04-17 08:57:17.712 [FATAL][1] cni-installer/<nil> <nil>: Unable to create token for CNI kubeconfig error=serviceaccounts "calico-cni-plugin" is forbidden: User "system:serviceaccount:kube-system:calico-node" cannot create resource "serviceaccounts/token" in API group "" in the namespace "kube-system"

caseydavenport commented 2 months ago

@DJanyavula please read the earlier comments - there are a number of diagnostics requested for anyone hitting this issue, and without those we can't help.

txbxxx commented 1 month ago

root@Tc-Server:/opt/calio# kubectl auth can-i create serviceaccounts/calico-cni-plugin -n calico-system --subresource token --as "system:serviceaccount:calico-system:calico-node" no Using kind to build k8s, using calio when cni, this return no how to solve?

caseydavenport commented 1 month ago

kubectl auth can-i create serviceaccounts/calico-cni-plugin -n calico-system --subresource token --as "system:serviceaccount:calico-system:calico-node"

This command checks if the CNI plugin has the necessary permissions. If it does not, it means the ClusterRole, ClusterRoleBinding, or ServiceAccount are not correctly configured or installed on the cluster.

For anyone experiencing this, please first verify that the Calico CNI plugin RBAC resources exist on the cluster:

kubectl get clusterrole calico-cni-plugin
kubectl get clusterrolebinding calico-cni-plugin
kubectl get serviceaccount -n calico-system calico-cni-plugin

If any of those do not exist, it means the tigera/operator is struggling to create them for some reason. Please look at the tigera-operator logs, and the output from kubectl describe tigerastatus calico to see what the issue might be.

If those resources do exist, the next thing to check is whether they are stuck in a Terminating state or not. You can do this by checking the deletionTimestamp field on each of them.

kubectl get clusterrole calico-cni-plugin | grep deletionTimestamp
kubectl get clusterrolebinding calico-cni-plugin | grep deletionTimestamp
kubectl get serviceaccount -n calico-system calico-cni-plugin | grep deletionTimestamp

If you see a deletionTimestmp field on any of them, it means that the resource is trying to be deleted but is stuck. In this case, you need to edit each resource to remove the cni-protector entry from the finalizers field to allow the resources to terminate and the tigera-operator to re-create them. Once that is done, you can restart calico/node and things should start to function.

This is a result of a bug that is triggered often when deleting / re-creating the Installation resource, or manually deleting the RBAC resources by hand. The former has been fixed in v3.28, and the latter should be simply be avoided. If you do need to delete these by hand without deleting the Installation, know that you will need to manually remove the finalizer and that they will be recreated by the operator.

ajaypraj commented 4 weeks ago

Hi All, I am facing same issue. I am trying to setup the cluster using kubernetes 1.26 and kubeadm . But calio pods are not coming up due to token creation failure . Here is log for same:

{"log":"2024-06-05 12:38:15.352 [ERROR]. [1] cni-installer/\u003cnil\u003e \u003cnil\u003e: Unable to create token for CNI kubeconfig error=Post \"https://172.16.0.1:443/api/v1/namespaces/kube-system/serviceaccounts/calico-cni-plugin/token\": dial tcp 172.16.0.1:443: i/o timeout\n","stream":"stderr","time":"2024-06-05T12:38:15.352838536Z"} 

{"log":"2024-06-05 12:38:15.352 [FATAL][1] cni-installer/\u003cnil\u003e \u003cnil\u003e: Unable to create token for CNI kubeconfig error=Post \"https://172.16.0.1:443/api/v1/namespaces/kube-system/serviceaccounts/calico-cni-plugin/token\": dial tcp 172.16.0.1:443: i/o timeout\n","stream":"stderr","time":"2024-06-05T12:38:15.352956389Z"}

Expected behaviour All pods should be up and running .

Current behaviour After running kubeadm init all pods are running expect core-dns pods . Installed calico cni using manifest .

Checked the service account and found that secret token is missing.

root@debian:~# kubectl get serviceaccount -n kube-system calico-node -o yaml apiVersion: v1 kind: ServiceAccount metadata: creationTimestamp: "2024-06-04T11:30:25Z" name: calico-node namespace: kube-system resourceVersion: "139365" uid: 7c3bdfb9-1073-4006-ac8c-61d619ed0a91

Checked Calico CNI plugin RBAC resources exist on the cluster:

root@debian:~# kubectl get clusterrole calico-cni-plugin
NAME                CREATED AT
calico-cni-plugin   2024-06-04T11:30:25Z
root@debian:~# kubectl get clusterrolebinding calico-cni-plugin
NAME                ROLE                            AGE
calico-cni-plugin   ClusterRole/calico-cni-plugin   25h
root@debian:~# kubectl get serviceaccount -n kube-system calico-cni-plugin
NAME                SECRETS   AGE
calico-cni-plugin   0         25h

==== checked the resources are not stuck in terminating state

root@debian:~# kubectl get serviceaccount -n kube-system calico-cni-plugin
NAME                SECRETS   AGE
calico-cni-plugin   0         25h
root@debian:~# kubectl get clusterrole calico-cni-plugin | grep deletionTimestamp
root@debian:~# kubectl get clusterrolebinding calico-cni-plugin | grep deletionTimestamp
root@debian:~# kubectl get serviceaccount -n kube-system calico-cni-plugin | grep deletionTimestamp

however, running command can-i create service account it is returning no.

kubectl auth can-i create serviceaccounts/calico-cni-plugin -n calico-system --subresource token --as "system:serviceaccount:calico-system:calico-node"
no

How to reproduce it : kubeadm init cluster_config.yaml kubectl apply -f calico.yaml [here is link of file ]

Here is cluster_config.yaml file:

apiVersion: kubeadm.k8s.io/v1beta2
kind: InitConfiguration
nodeRegistration:
  criSocket: "/var/run/cri-dockerd.sock"

---
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
kubernetesVersion: "v1.26.15"
networking:
  podSubnet: "172.16.0.0/24,fde1::/64"
  serviceSubnet: "172.16.1.0/16,fde1::/112"
controllerManager:
  extraArgs:
    feature-gates: "IPv6DualStack=true"
apiServer:
  extraArgs:
    advertise-address: "172.16.2.1"
    tls-cipher-suites: "TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256"

---
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
tlsCipherSuites:
  - "TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256"
  - "TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384"
  - "TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256"

Environment details I am running cluster on debian 11 VM, which is running on vlware vcenter.Here are details: OS: Debian 11 Kubernetes Version: 1.26.15 kubeadm Version: 1.26.15 Docker Version: 26.1.3 Container Runtime: cri-dockerd 0.3.14

Steps tried: rm -rf /etc/cni/net.d/* && systemctl restart kubelet , but it does not resolve the issue.

Any troubleshooting or help is highly appreciated.

jon-nfc commented 4 weeks ago

@ajaypraj

{"log":"2024-06-05 12:38:15.352 [ERROR]. [1] cni-installer/\u003cnil\u003e \u003cnil\u003e: Unable to create token for CNI kubeconfig error=Post \"https://172.16.0.1:443/api/v1/namespaces/kube-system/serviceaccounts/calico-cni-plugin/token\": dial tcp 172.16.0.1:443: i/o timeout\n","stream":"stderr","time":"2024-06-05T12:38:15.352838536Z"}

what address is 172.16.0.1:443 attached to? as error dial tcp 172.16.0.1:443: i/o timeout\n" may be the issue and seperate to calico

other command that may assist in fault finding:

kubectl get endpoints -A
iptables -nvL pay particular attention to the chains prefixed with KUBE- i.e. KUBE-FORWARD
curl -k https://<service ip for kubernetes.default >:443/api should return JSON stating unauthorized access. this is good as it shows api connectivity OK.

caseydavenport commented 4 weeks ago

what address is 172.16.0.1:443 attached to? as error dial tcp 172.16.0.1:443: i/o timeout\n" may be the issue and seperate to calico

Yeah, I agree - @ajaypraj it sounds to me like you are encountering a different issue from this one. Could you raise a separate GitHub issue to track that?

ajaypraj commented 4 weeks ago

@caseydavenport @jon-nfc Thanks and sure, will open issue if not getting resolved.

acrjene commented 3 weeks ago

I fixed this problem by disabling VPN