Failed to "KillPodSandbox" due to calico connection is unauthorized

sysnet4admin commented 1 year ago

After some period, Pods cannot create and delete with this message

$ kubectl describe pod <name>
error killing pod: failed to "KillPodSandbox" for "9f91266a-70a9-428f-a1d6-a2ae8d5427d1" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to destroy network for sandbox \"4657b77480472f4352e413d52e0c5d5545c675da862cc56c8e6f22d7b0577031\": plugin type=\"calico\" failed (delete): error getting ClusterInformation: connection is unauthorized: Unauthorized"

It seems to be relate with the service account of policy changed from kubernetes v1.26.0 https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#manual-secret-management-for-serviceaccounts

Here is the workaround of solution. re-read calico-node information by restart or delete.

$ kubectl rollout restart ds -n kube-system calico-node

Expected Behavior

kubectl create or delete is working fine.

Current Behavior

It won't work properly

[root@m-k8s ~]# kubectl get po
NAME                                      READY   STATUS              RESTARTS      AGE
dpy-nginx-6564b9dbcc-d7jj5                0/1     ContainerCreating   0             17m
dpy-nginx-6564b9dbcc-vgjmw                0/1     ContainerCreating   0             17m
dpy-nginx-6564b9dbcc-wbr59                0/1     ContainerCreating   0             17m
nfs-client-provisioner-7596fb9c9c-gmpmn   0/1     Terminating         0             47h
nfs-client-provisioner-7596fb9c9c-jvmnm   1/1     Running             1 (46m ago)   42h
nginx-76d9fbf4fb-7xjgb                    0/1     Terminating         0             42h
nginx-76d9fbf4fb-dv48n                    1/1     Running             0             42h
nginx-76d9fbf4fb-kqp5j                    1/1     Running             0             42h
nginx-76d9fbf4fb-qrl4p                    1/1     Running             0             42h
nginx-76d9fbf4fb-wlpwd                    1/1     Running             0             42h

Possible Solution

`Workaround' is restart daemonset or delete pod.

OR

'Possible Solution' is that create a long period secret token for service account instead of this. and use this secret with service account for calico-node. (it is related with #5712 #6421)

sh-4.4# cat /var/run/secrets/kubernetes.io/serviceaccount/token 
eyJhbGciOiJSUzI1NiIsImtpZCI6IjlpTFk5RXlJR29yb01VZjlXOGg0UGhvLWhLRGhtZnNvekdyeU0xdVlFUTAifQ.eyJhdWQiOlsiaHR0cHM6Ly9rdWJlcm5ldGVzLmRlZmF1bHQuc3ZjLmNsdXN0ZXIubG9jYWwiXSwiZXhwIjoxNzA1OTc1ODA5LCJpYXQiOjE2NzQ0Mzk4MDksImlzcyI6Imh0dHBzOi8va3ViZXJuZXRlcy5kZWZhdWx0LnN2Yy5jbHVzdGVyLmxvY2FsIiwia3ViZXJuZXRlcy5pbyI6eyJuYW1lc3BhY2UiOiJrdWJlLXN5c3RlbSIsInBvZCI6eyJuYW1lIjoiY2FsaWNvLW5vZGUtOWRnZzIiLCJ1aWQiOiIxY2UwODRlYS1kNzIzLTQ5MDAtYjI1ZC00YzRhNTVmMmI0OWYifSwic2VydmljZWFjY291bnQiOnsibmFtZSI6ImNhbGljby1ub2RlIiwidWlkIjoiM2RhYmI5MmYtN2UzYy00ZTkyLWI4OTUtZmM3NzczM2RlMTBmIn0sIndhcm5hZnRlciI6MTY3NDQ0MzQxNn0sIm5iZiI6MTY3NDQzOTgwOSwic3ViIjoic3lzdGVtOnNlcnZpY2VhY2NvdW50Omt1YmUtc3lzdGVtOmNhbGljby1ub2RlIn0.SC5WdggKDD-SE2ZnIfNYaMROXNvJVqqdKXdF6SCN_qrLBwmLwXbSHnQA_vkBBFHqi1qsQP2CuBx0beYUzm5VkcBt7LMZeDBHaOfDIfBvwMbzkAAMcSoqd6bnZi1mZa8Mf2ZTVEvhLOJSyb9npGAa0te6xfWAvEbTmGWTOvZaQ59y-RqJ9OfqAiYYWoEDCLpjjjG0F1-ke2_6eRx7m6Ri2Ne47WKGGURfMVvf2GAtV0xrYuI2tvA8UhivzhaPiJx56RfyVmVAnrl8qfBk0rG6J43TkPGA59R52vbvJkI_9k-kPw_OXJv35YDqgExn3i7CswGUZCX9TAGkET5mpm7u4w

Steps to Reproduce (for bugs)

Deploy native-kubernetes by vagrant-script (link)
Wait for 1-2days

Deploy new deployment

[root@m-k8s ~]# k create deploy new-nginx --image=nginx --replicas=3
deployment.apps/new-nginx created

Check deployment status

[root@m-k8s ~]# kubectl get po
NAME                                                       READY   STATUS              RESTARTS      AGE
new-nginx-6564b9dbcc-<hash>              0/1     ContainerCreating   0               15m
new-nginx-6564b9dbcc-<hash>              0/1     ContainerCreating   0               15m
new-nginx-6564b9dbcc-<hash>              0/1     ContainerCreating   0               15m

Context

It already applied to the code from #6218 node/pkg/cni/token_watch.go

const defaultCNITokenValiditySeconds = 24 * 60 * 60
const minTokenRetryDuration = 5 * time.Second
const defaultRefreshFraction = 4

func NewTokenRefresher(clientset *kubernetes.Clientset, namespace string, serviceAccountName string) *TokenRefresher {
    return NewTokenRefresherWithCustomTiming(clientset, namespace, serviceAccountName, defaultCNITokenValiditySeconds, minTokenRetryDuration, defaultRefreshFraction)
}

So I decoded applied JWT on the calico-node. It confirmed 1 year(365d) properly.
JWT

sh-4.4# cat /var/run/secrets/kubernetes.io/serviceaccount/token 
eyJhbGciOiJSUzI1NiIsImtpZCI6IjlpTFk5RXlJR29yb01VZjlXOGg0UGhvLWhLRGhtZnNvekdyeU0xdVlFUTAifQ.eyJhdWQiOlsiaHR0cHM6Ly9rdWJlcm5ldGVzLmRlZmF1bHQuc3ZjLmNsdXN0ZXIubG9jYWwiXSwiZXhwIjoxNzA1OTc1ODA5LCJpYXQiOjE2NzQ0Mzk4MDksImlzcyI6Imh0dHBzOi8va3ViZXJuZXRlcy5kZWZhdWx0LnN2Yy5jbHVzdGVyLmxvY2FsIiwia3ViZXJuZXRlcy5pbyI6eyJuYW1lc3BhY2UiOiJrdWJlLXN5c3RlbSIsInBvZCI6eyJuYW1lIjoiY2FsaWNvLW5vZGUtOWRnZzIiLCJ1aWQiOiIxY2UwODRlYS1kNzIzLTQ5MDAtYjI1ZC00YzRhNTVmMmI0OWYifSwic2VydmljZWFjY291bnQiOnsibmFtZSI6ImNhbGljby1ub2RlIiwidWlkIjoiM2RhYmI5MmYtN2UzYy00ZTkyLWI4OTUtZmM3NzczM2RlMTBmIn0sIndhcm5hZnRlciI6MTY3NDQ0MzQxNn0sIm5iZiI6MTY3NDQzOTgwOSwic3ViIjoic3lzdGVtOnNlcnZpY2VhY2NvdW50Omt1YmUtc3lzdGVtOmNhbGljby1ub2RlIn0.SC5WdggKDD-SE2ZnIfNYaMROXNvJVqqdKXdF6SCN_qrLBwmLwXbSHnQA_vkBBFHqi1qsQP2CuBx0beYUzm5VkcBt7LMZeDBHaOfDIfBvwMbzkAAMcSoqd6bnZi1mZa8Mf2ZTVEvhLOJSyb9npGAa0te6xfWAvEbTmGWTOvZaQ59y-RqJ9OfqAiYYWoEDCLpjjjG0F1-ke2_6eRx7m6Ri2Ne47WKGGURfMVvf2GAtV0xrYuI2tvA8UhivzhaPiJx56RfyVmVAnrl8qfBk0rG6J43TkPGA59R52vbvJkI_9k-kPw_OXJv35YDqgExn3i7CswGUZCX9TAGkET5mpm7u4w

Decoded JWT's Payload

{
  "aud": [
    "https://kubernetes.default.svc.cluster.local"
  ],
  "exp": 1705975809,    <<<< Tue Jan 23 2024 02:10:09 GMT+0000 
  "iat": 1674439809,
  "iss": "https://kubernetes.default.svc.cluster.local",
  "kubernetes.io": {
    "namespace": "kube-system",
    "pod": {
      "name": "calico-node-9dgg2",
      "uid": "1ce084ea-d723-4900-b25d-4c4a55f2b49f"
    },
    "serviceaccount": {
      "name": "calico-node",
      "uid": "3dabb92f-7e3c-4e92-b895-fc77733de10f"
    },
    "warnafter": 1674443416
  },
  "nbf": 1674439809,
  "sub": "system:serviceaccount:kube-system:calico-node"
}

Thus this issue is a little different logic to verify the authorization from kubernetes.

/var/log/message from all nodes like below when it happened.

[control-plane node]

Jan 23 09:10:35 m-k8s kubelet: E0123 09:10:35.298683    4180 server.go:299] "Unable to authenticate the request due to an error" err="[invalid bearer token, service account token has been invalidated]"
Jan 23 09:10:50 m-k8s kubelet: E0123 09:10:50.303499    4180 server.go:299] "Unable to authenticate the request due to an error" err="[invalid bearer token, service account token has been invalidated]"
Jan 23 09:11:05 m-k8s kubelet: E0123 09:11:05.308058    4180 server.go:299] "Unable to authenticate the request due to an error" err="[invalid bearer token, service account token has been invalidated]"
Jan 23 09:11:20 m-k8s kubelet: E0123 09:11:20.300704    4180 server.go:299] "Unable to authenticate the request due to an error" err="[invalid bearer token, service account token has been invalidated]"
Jan 23 09:11:35 m-k8s kubelet: E0123 09:11:35.290727    4180 server.go:299] "Unable to authenticate the request due to an error" err="[invalid bearer token, service account token has been invalidated]"
<snipped>

[worker node]

Jan 21 16:44:12 w2-k8s kubelet: E0121 16:44:12.656423    3630 server.go:299] "Unable to authenticate the request due to an error" err="[invalid bearer token, service account token has been invalidated]"
Jan 21 16:44:27 w2-k8s kubelet: E0121 16:44:27.650877    3630 server.go:299] "Unable to authenticate the request due to an error" err="[invalid bearer token, service account token has been invalidated]"

Your Environment

Calico version: v3.24.5, v3.25.0

Orchestrator version (e.g. kubernetes, mesos, rkt): native-kubernetes v1.26.0

[root@m-k8s ~]# kubectl get nodes -o wide 
NAME     STATUS   ROLES           AGE     VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                KERNEL-VERSION                CONTAINER-RUNTIME
m-k8s    Ready    control-plane   2d19h   v1.26.0   192.168.1.10    <none>        CentOS Linux 7 (Core)   3.10.0-1127.19.1.el7.x86_64   containerd://1.6.10
w1-k8s   Ready    <none>          2d19h   v1.26.0   192.168.1.101   <none>        CentOS Linux 7 (Core)   3.10.0-1127.19.1.el7.x86_64   containerd://1.6.10
w2-k8s   Ready    <none>          2d19h   v1.26.0   192.168.1.102   <none>        CentOS Linux 7 (Core)   3.10.0-1127.19.1.el7.x86_64   containerd://1.6.10
w3-k8s   Ready    <none>          2d18h   v1.26.0   192.168.1.103   <none>        CentOS Linux 7 (Core)   3.10.0-1127.19.1.el7.x86_64   containerd://1.6.10

Operating System and version: CentOS 7.9 (3.10.0-1127.19.1.el7.x86_64)
Link to your project (optional):

coutinhop commented 1 year ago

@sysnet4admin could you please fill in the template with more information about your issue?

sysnet4admin commented 1 year ago

@coutinhop Oh..? I am so sorry, it does't mean to upload without any comment. (My cat push the button? something..? anyhow OMG) Thus I updated all I know so far. The trigger or reproducing procedure is not clear yet. Therefore I will clarify for duplicated protocol as soon.

Thank you for letting me know empty issue that I upload.

sysnet4admin commented 1 year ago

Duplicated

[ pod for all namespaces ]

[root@m-k8s ~]# kubectl get po -A
NAMESPACE        NAME                                             READY   STATUS        RESTARTS      AGE
default          new-nginx-d8b84d87b-jpzr9                        1/1     Running       0             21h
default          new-nginx-d8b84d87b-r245z                        1/1     Running       0             21h
default          new-nginx-d8b84d87b-xjc8k                        1/1     Running       0             21h
default          nfs-client-provisioner-7596fb9c9c-jvmnm          1/1     Running       1 (28h ago)   2d22h
default          synthetic-load-generator-554f846686-fxgms        1/1     Running       0             3h45m
example-hotrod   example-hotrod-6c5d878866-bbt7l                  1/1     Running       0             4h47m
ingress-nginx    ingress-nginx-admission-create-bqvnp             0/1     Completed     0             5h3m
ingress-nginx    ingress-nginx-admission-patch-sdjbr              0/1     Completed     1             5h3m
ingress-nginx    ingress-nginx-controller-64f79ddbcc-7wltw        1/1     Running       0             5h1m
kube-system      calico-kube-controllers-57b57c56f-96j5s          1/1     Running       0             3d3h
kube-system      calico-node-79rvm                                1/1     Running       0             25h
kube-system      calico-node-bc54v                                1/1     Running       0             25h
kube-system      calico-node-xx5c4                                1/1     Running       0             25h
kube-system      calico-node-zlk6h                                1/1     Running       0             25h
kube-system      coredns-787d4945fb-n5z6g                         1/1     Running       0             3d3h
kube-system      coredns-787d4945fb-q6zj8                         1/1     Running       0             3d3h
kube-system      etcd-m-k8s                                       1/1     Running       0             3d3h
kube-system      kube-apiserver-m-k8s                             1/1     Running       0             3d3h
kube-system      kube-controller-manager-m-k8s                    1/1     Running       0             3d3h
kube-system      kube-proxy-6wrc9                                 1/1     Running       0             3d3h
kube-system      kube-proxy-drtcr                                 1/1     Running       1 (27h ago)   3d1h
kube-system      kube-proxy-hmp89                                 1/1     Running       0             3d3h
kube-system      kube-proxy-hnxrh                                 1/1     Running       0             3d3h
kube-system      kube-scheduler-m-k8s                             1/1     Running       0             3d3h
kube-system      metrics-server-7948965fbb-56tct                  1/1     Running       0             27h
metallb-system   controller-577b5bdfcc-tj6nq                      1/1     Running       0             27h
metallb-system   speaker-8szsl                                    1/1     Running       0             3d3h
metallb-system   speaker-j4hsp                                    1/1     Running       0             3d3h
metallb-system   speaker-pm9jj                                    1/1     Running       0             3d3h
metallb-system   speaker-rg9wk                                    1/1     Running       2 (27h ago)   3d1h
monitoring       grafana-5d9c96fc4c-x4sm8                         0/1     Terminating   0             3d1h
monitoring       jaeger-5dc997d86c-trhnb                          1/1     Running       0             4h25m
monitoring       prometheus-kube-state-metrics-5f69cf9d49-tr24p   0/1     Terminating   0             3d1h
monitoring       tempo-0                                          2/2     Running       0             3h45m

[ Describe for terminating pod ]

[root@m-k8s ~]# k describe po -n monitoring prometheus-kube-state-metrics-5f69cf9d49-tr24p 
Name:                      prometheus-kube-state-metrics-5f69cf9d49-tr24p
Namespace:                 monitoring
Priority:                  0
Service Account:           prometheus-kube-state-metrics
Node:                      w2-k8s/192.168.1.102
Start Time:                Sat, 21 Jan 2023 13:35:29 +0900
Labels:                    app.kubernetes.io/component=metrics
                           app.kubernetes.io/instance=prometheus
                           app.kubernetes.io/managed-by=Helm
                           app.kubernetes.io/name=kube-state-metrics
                           app.kubernetes.io/part-of=kube-state-metrics
                           app.kubernetes.io/version=2.4.1
                           helm.sh/chart=kube-state-metrics-4.7.0
                           pod-template-hash=5f69cf9d49
Annotations:               cni.projectcalico.org/containerID: 00a3ada685842372182116b86e1dd5de8ceb0eca50ba212dd8e4c046d95fa193
                           cni.projectcalico.org/podIP: 172.16.103.129/32
                           cni.projectcalico.org/podIPs: 172.16.103.129/32
Status:                    Terminating (lasts 10m)
Termination Grace Period:  30s
IP:                        172.16.103.129
IPs:
  IP:           172.16.103.129
Controlled By:  ReplicaSet/prometheus-kube-state-metrics-5f69cf9d49
Containers:
  kube-state-metrics:
    Container ID:  containerd://73384c61d39eadaa7a20794631b1b7c31ff268b46cb12e360917939000a781c4
    Image:         k8s.gcr.io/kube-state-metrics/kube-state-metrics:v2.4.1
    Image ID:      k8s.gcr.io/kube-state-metrics/kube-state-metrics@sha256:69a18fa1e0d0c9f972a64e69ca13b65451b8c5e79ae8dccf3a77968be4a301df
    Port:          8080/TCP
    Host Port:     0/TCP
    Args:
      --port=8080
      --resources=certificatesigningrequests,configmaps,cronjobs,daemonsets,deployments,endpoints,horizontalpodautoscalers,ingresses,jobs,limitranges,mutatingwebhookconfigurations,namespaces,networkpolicies,nodes,persistentvolumeclaims,persistentvolumes,poddisruptionbudgets,pods,replicasets,replicationcontrollers,resourcequotas,secrets,services,statefulsets,storageclasses,validatingwebhookconfigurations,volumeattachments
      --telemetry-port=8081
    State:          Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Sat, 21 Jan 2023 13:35:39 +0900
      Finished:     Tue, 24 Jan 2023 14:42:56 +0900
    Ready:          False
    Restart Count:  0
    Liveness:       http-get http://:8080/healthz delay=5s timeout=5s period=10s #success=1 #failure=3
    Readiness:      http-get http://:8080/ delay=5s timeout=5s period=10s #success=1 #failure=3
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-5d4nv (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  kube-api-access-5d4nv:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>

Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason         Age                 From     Message
  ----     ------         ----                ----     -------
  Normal   Killing        11m                 kubelet  Stopping container kube-state-metrics
  Warning  FailedKillPod  85s (x50 over 11m)  kubelet  error killing pod: failed to "KillPodSandbox" for "8f47d16a-70e9-49cc-be1c-4bf5911cdc57" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to destroy network for sandbox \"00a3ada685842372182116b86e1dd5de8ceb0eca50ba212dd8e4c046d95fa193\": plugin type=\"calico\" failed (delete): error getting ClusterInformation: connection is unauthorized: Unauthorized"

[calico-#7220-cluster-info.dump] calico-7220-cluster-info.dump.zip

sysnet4admin commented 1 year ago

Plus workaround is effective.

[root@m-k8s ~]# kubectl rollout restart ds -n kube-system calico-node
daemonset.apps/calico-node restarted

[root@m-k8s ~]# k get po -A
NAMESPACE        NAME                                        READY   STATUS      RESTARTS      AGE
default          new-nginx-d8b84d87b-jpzr9                   1/1     Running     0             21h
default          new-nginx-d8b84d87b-r245z                   1/1     Running     0             21h
default          new-nginx-d8b84d87b-xjc8k                   1/1     Running     0             21h
default          nfs-client-provisioner-7596fb9c9c-jvmnm     1/1     Running     1 (29h ago)   2d22h
default          synthetic-load-generator-554f846686-fxgms   1/1     Running     0             4h
example-hotrod   example-hotrod-6c5d878866-bbt7l             1/1     Running     0             5h2m
ingress-nginx    ingress-nginx-admission-create-bqvnp        0/1     Completed   0             5h18m
ingress-nginx    ingress-nginx-admission-patch-sdjbr         0/1     Completed   1             5h18m
ingress-nginx    ingress-nginx-controller-64f79ddbcc-7wltw   1/1     Running     0             5h16m
kube-system      calico-kube-controllers-57b57c56f-96j5s     1/1     Running     0             3d3h
kube-system      calico-node-fpmtb                           1/1     Running     0             77s
kube-system      calico-node-gmksz                           1/1     Running     0             66s
kube-system      calico-node-hzk7k                           1/1     Running     0             45s
kube-system      calico-node-zqd24                           1/1     Running     0             56s
kube-system      coredns-787d4945fb-n5z6g                    1/1     Running     0             3d3h
kube-system      coredns-787d4945fb-q6zj8                    1/1     Running     0             3d3h
kube-system      etcd-m-k8s                                  1/1     Running     0             3d3h
kube-system      kube-apiserver-m-k8s                        1/1     Running     0             3d3h
kube-system      kube-controller-manager-m-k8s               1/1     Running     0             3d3h
kube-system      kube-proxy-6wrc9                            1/1     Running     0             3d3h
kube-system      kube-proxy-drtcr                            1/1     Running     1 (27h ago)   3d2h
kube-system      kube-proxy-hmp89                            1/1     Running     0             3d3h
kube-system      kube-proxy-hnxrh                            1/1     Running     0             3d3h
kube-system      kube-scheduler-m-k8s                        1/1     Running     0             3d3h
kube-system      metrics-server-7948965fbb-56tct             1/1     Running     0             28h
metallb-system   controller-577b5bdfcc-tj6nq                 1/1     Running     0             28h
metallb-system   speaker-8szsl                               1/1     Running     0             3d3h
metallb-system   speaker-j4hsp                               1/1     Running     0             3d3h
metallb-system   speaker-pm9jj                               1/1     Running     0             3d3h
metallb-system   speaker-rg9wk                               1/1     Running     2 (27h ago)   3d2h
monitoring       jaeger-5dc997d86c-trhnb                     1/1     Running     0             4h40m
monitoring       tempo-0                                     2/2     Running     0             4h

besmirzanaj commented 1 year ago

same behaviour

8m32s       Warning   FailedCreatePodSandBox   pod/hello-27927411-gk5nf                           (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "c9cf89858e821ef4eb9502deb09725cf8e88be7675d9861fa1a2d25cc03a596f": plugin type="calico" failed (add): error getting ClusterInformation: connection is unauthorized: Unauthorized

cluster info

k get nodes -o wide
NAME    STATUS   ROLES                       AGE   VERSION          INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
k8sc1   Ready    control-plane,etcd,master   88d   v1.24.6+rke2r1   192.168.88.87   <none>        Ubuntu 22.04.1 LTS   5.15.0-58-generic   containerd://1.6.8-k3s1
k8sc2   Ready    <none>                      88d   v1.24.6+rke2r1   192.168.88.88   <none>        Ubuntu 22.04.1 LTS   5.15.0-58-generic   containerd://1.6.8-k3s1
k8sc3   Ready    <none>                      88d   v1.24.6+rke2r1   192.168.88.89   <none>        Ubuntu 22.04.1 LTS   5.15.0-58-generic   containerd://1.6.8-k3s1
k8sc4   Ready    <none>                      82d   v1.24.6+rke2r1   192.168.88.90   <none>        Ubuntu 22.04.1 LTS   5.15.0-58-generic   containerd://1.6.8-k3s1
k8sc5   Ready    <none>                      71d   v1.24.6+rke2r1   192.168.88.91   <none>        Ubuntu 22.04.1 LTS   5.15.0-58-generic   containerd://1.6.8-k3s1
k8sc6   Ready    <none>                      88d   v1.24.6+rke2r1   192.168.88.92   <none>        Ubuntu 22.04.1 LTS   5.15.0-58-generic   containerd://1.6.8-k3s1
k8sc7   Ready    <none>                      88d   v1.24.6+rke2r1   192.168.88.93   <none>        Ubuntu 22.04.1 LTS   5.15.0-58-generic   containerd://1.6.8-k3s1

$

sysnet4admin commented 1 year ago

FYI k8s v1.26.1 + calico_v3.24.5 = duplicated k8s v1.26.1 + calico_v3.25.0 = duplicated k8s v1.25.6 + calico_v3.24.5 = duplicated
k8s v1.25.6 + calico_v3.25.0 = duplicated

k8s v1.26.1 + calico_v3.17.1 = NOT duplicated (i.e. solved issue with this version)

caseydavenport commented 1 year ago

service account token has been invalidated

Could there be something else in your cluster invalidating the tokens somehow?

nitinjagjivan commented 1 year ago

Facing same issue.

We refrain using workaround so any updates on how to get rid of this issue? How can we tackle service account policy changes in kubernetes v1.26 mentioned in this issue description?

I'm using - k8s v1.26.1 + calico_v3.25.0 + containerd 1.6.18

regiapriandi012 commented 1 year ago

I have the same problem, if u think that's clear in master, maybe the problem can solving at another node or worker.

sysnet4admin commented 1 year ago

@caseydavenport My lab is not running anymore due to limited resource issue. So I will setup and recheck within 1-2 weeks and let you know if there is some any invalidation tokens or evidence for it.

vyom-soft commented 1 year ago

I am seeing the same issue. After storage was extended on the device. I am on Kubspray, k8 v1.24.6, calico v3.25.0, containerd v1.7.0. I re-executed the ansible playbook, it did not help even after restart of NetworkManager, containerd, kubelet.

HcgRandon commented 1 year ago

Also facing this issue in canal as well. Causing a lot of headaches in my production cluster. any ideas on how to fix this?

EDIT: As suggested before a reboot kubectl rollout restart ds -n kube-system canal seemed to of fixed this for me however when i rebooted I had rbac issues the calico cluster role didn't have:

  - apiGroups: [""]
    resources:
      - serviceaccounts/token
    verbs:
      - create

I will see if after a few days if the issue persists

caseydavenport commented 1 year ago

seemed to of fixed this for me however when i rebooted I had rbac issues the calico cluster role didn't have:

Was this missing from the manifest in our docs? Or just the manifest in your cluster? Make sure when upgrading that you are pulling the latest manifest from our release.

HcgRandon commented 1 year ago

Sorry for not being more clear, I use rancher rke to bootstrap my cluster and it seems they didn't have the latest manifests.

jfarleyx commented 1 year ago

Facing same issue using k8s: v1.27.0 & calico: v3.25.1.

Installed calico using calico manifest. kubectl rollout restart ds -n calico-system calico-node temporarily fixes the issue. I also verified that the calico-node cluster role has create perms for serviceaccounts/token.

paulcostinean commented 1 year ago

I found one ambiguity that can make this confusing. TokenWatcher ignores the KUBECONFIG variable and assumes that the location of kubeconfig is /host/etc/cni/net.d/calico-kubeconfig (hardcoded)

paulcostinean commented 1 year ago

I rebuilt calico 3.25.1 with a very low token ttl to understand what the flow is and I think the way this works at the moment when KUBECONFIG is set is quite confusing.

The way calico-kubeconfig is handled in this case is the following:

install-cni starts up and builds a clientset with the in cluster service account token to lay down a calico-kubeconfig (with a token with a 24h ttl) - if a calico-kubeconfig already exists, it tries to use that, but if the jwt is expired, it uses in cluster config and replaces it (which means it doesn't fail) - if the existing token is valid for a long time, I think it won't rotate it (for example if you do a rollout restart in quick succession)
calico-node starts after install-cni finishes; if the KUBECONFIG env variable is set, it will use that for kubernetes access
after less than 24h, calico-node attempts to rotate the token using the KUBECONFIG clientset - it succeeds, but the problem is that the clientset in use by calico in token_watch will never reload it after it was rotated by itself - an easy fix would be to rebuild the clientset in the token watch loop here or change the clientset build here to ignore the kubeconfig variable
calico-node eventually fails with unauthorized errors - first in token renewal (calico will still work), and after calico-kubeconfig expires, everything stops working - in my opinion this should be a hard failure, calico-node is broken at this point and restarting is the only "fix"

Also noticed a "hard" failure in install-cni if KUBECONFIG is set and the file doesn't exist (such as when creating a new cluster) - install-cni will rotate it if it's an empty file or if the jwt is expired, but it will fail hard if the file doesn't exist - I think it should just create the file

I think there are some strange interactions between manually setting this variable and how #7106 would work in a future release.

Can someone with more context comment on what's the expected flow here?

nitinjagjivan commented 1 year ago

Facing same issue using k8s: v1.27.0 & calico: v3.25.1.

Installed calico using calico manifest. kubectl rollout restart ds -n calico-system calico-node temporarily fixes the issue. I also verified that the calico-node cluster role has create perms for serviceaccounts/token.

This workaround didn't help me, but deleting the files from folder /etc/cni/net.d/* worked for me.

sysnet4admin commented 1 year ago

FYI k8s v1.27.2 + calico_v3.26.0 = Looking good after AGE 42H

[root@cp-k8s ~]# k get node 
NAME     STATUS   ROLES           AGE   VERSION
cp-k8s   Ready    control-plane   42h   v1.27.2
w1-k8s   Ready    <none>          42h   v1.27.2
w2-k8s   Ready    <none>          42h   v1.27.2
w3-k8s   Ready    <none>          42h   v1.27.2
[root@cp-k8s ~]# k get po,svc 
NAME                                          READY   STATUS              RESTARTS   AGE
pod/deploy-nginx-66df7dc8d9-8r545             1/1     Running             0          42h
pod/deploy-nginx-66df7dc8d9-bc9f6             1/1     Running             0          42h
pod/deploy-nginx-66df7dc8d9-cqfj6             1/1     Running             0          42h
pod/deploy-nginx-66df7dc8d9-fkf99             1/1     Running             0          42h
pod/deploy-nginx-66df7dc8d9-mrrl6             1/1     Running             0          42h
pod/deploy-nginx-66df7dc8d9-q6zgn             1/1     Running             0          42h

NAME                   TYPE           CLUSTER-IP     EXTERNAL-IP    PORT(S)        AGE
service/deploy-nginx   LoadBalancer   10.101.73.62   192.168.1.11   80:31560/TCP   42h
service/kubernetes     ClusterIP      10.96.0.1      <none>         443/TCP        42h

sysnet4admin commented 1 year ago

Last check k8s v1.27.2 + calico_v3.26.0 = Looking good after AGE 5D

[root@cp-k8s ~]# k get po,svc 
NAME                                          READY   STATUS    RESTARTS   AGE
pod/deploy-nginx-66df7dc8d9-6cd7l             1/1     Running   0          5d17h
pod/deploy-nginx-66df7dc8d9-77fnk             1/1     Running   0          5d17h
pod/deploy-nginx-66df7dc8d9-95mck             1/1     Running   0          5d17h
pod/deploy-nginx-66df7dc8d9-9fkzn             1/1     Running   0          5d17h
pod/deploy-nginx-66df7dc8d9-hnbh2             1/1     Running   0          5d17h
pod/deploy-nginx-66df7dc8d9-kh66b             1/1     Running   0          5d17h
pod/deploy-nginx-66df7dc8d9-q989q             1/1     Running   0          5d17h
pod/deploy-nginx-66df7dc8d9-qtvkq             1/1     Running   0          5d17h
pod/deploy-nginx-66df7dc8d9-xnvd8             1/1     Running   0          5d17h
pod/nfs-client-provisioner-597dbc5f74-7hw67   1/1     Running   0          5d18h

[root@cp-k8s ~]# k get ds -n kube-system -o yaml | grep -i image:
          image: docker.io/calico/node:v3.26.0
          image: docker.io/calico/cni:v3.26.0
          image: docker.io/calico/cni:v3.26.0
          image: docker.io/calico/node:v3.26.0
          image: registry.k8s.io/kube-proxy:v1.27.2

DennisNemec commented 1 year ago

For all who are still struggling with this issue: take a look to the logs of your calico-node pod. I had the same problem and found out that the ServiceAccount "calico-node" was not permitted to create a "serviceaccounts/token" ressource because it was restricted to the ressource name "calico-cni-plugin". I removed the restriction to "calico-cni-plugin" and it works now.

besmirzanaj commented 1 year ago

I removed the restriction to "calico-cni-plugin" and it works now.

Would you care to explain this and the steps please?

caseydavenport commented 1 year ago

"calico-node" was not permitted to create a "serviceaccounts/token" ressource because it was restricted to the ressource name "calico-cni-plugin". I removed the restriction to "calico-cni-plugin" and it works now.

As of Calico v3.26, the calico-node serviceaccount should not have permission to create any serviceaccount tokens except for the calico-cni-plugin token. This is done intentionally, so I'm curious if you could share the logs the clued you in to this change.

DennisNemec commented 1 year ago

As of Calico v3.26, the calico-node serviceaccount should not have permission to create any serviceaccount tokens except for the calico-cni-plugin token. This is done intentionally, so I'm curious if you could share the logs the clued you in to this change.

A little bit background information: at my company we are using IBM Cloud and their Kubernetes cluster which we updated 13 days ago. Yesterday we noticed that CRUD operations on any pod are failing. Regarding to cluster roles only one - in fact calico-cni-plugin was added. I don't know if the IBM cloud did create this cluster role automatically or one of my admins - our admin was not able to answer this question yet.

kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: calico-cni-plugin
  uid: bf8ebe3c-f491-4e63-bdb9-1801826917e5
  resourceVersion: '109270438'
  creationTimestamp: '2023-06-28T22:52:52Z'
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: >
      {"apiVersion":"rbac.authorization.k8s.io/v1","kind":"ClusterRole","metadata":{"annotations":{},"name":"calico-cni-plugin"},"rules":[{"apiGroups":[""],"resources":["pods","nodes","namespaces"],"verbs":["get"]},{"apiGroups":[""],"resources":["pods/status"],"verbs":["patch"]},{"apiGroups":["crd.projectcalico.org"],"resources":["blockaffinities","ipamblocks","ipamhandles","clusterinformations","ippools","ipreservations","ipamconfigs"],"verbs":["get","list","create","update","delete"]}]}
  managedFields:
    - manager: kubectl-client-side-apply
      operation: Update
      apiVersion: rbac.authorization.k8s.io/v1
      time: '2023-06-28T22:52:52Z'
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:annotations:
            .: {}
            f:kubectl.kubernetes.io/last-applied-configuration: {}
    - manager: dashboard
      operation: Update
      apiVersion: rbac.authorization.k8s.io/v1
      time: '2023-07-12T17:12:23Z'
      fieldsType: FieldsV1
      fieldsV1:
        f:rules: {}
rules:
  - verbs:
      - get
    apiGroups:
      - ''
    resources:
      - pods
      - nodes
      - namespaces
  - verbs:
      - patch
    apiGroups:
      - ''
    resources:
      - pods/status
  - verbs:
      - get
      - list
      - create
      - update
      - delete
    apiGroups:
      - crd.projectcalico.org
    resources:
      - blockaffinities
      - ipamblocks
      - ipamhandles
      - clusterinformations
      - ippools
      - ipreservations
      - ipamconfigs

Version:

Client Version:    v3.26.1
Git commit:        b1d192c95
Cluster Version:   v3.25.1
Cluster Type:      typha,kdd,k8s,bgp

Is the role calico-cni-plugin supposed to be allowed to create serviceaccount tokens?

What specific log do you want to take a look at?

Best regards and thank you for your help!

EDIT: unfortunetaly the logs of calico-node has been overwritten. But I can remember that it showed something like "service account 'calico-node:kube-system' has no permission to obtain a token".

EDIT2: just for a test I added again the resourceName 'calico-cni-plugin' to the service account token creation rule for the 'calico-node' cluster role and seems not work. The calico-node pod's log:

2023-07-12T17:47:10.338Z | 2023-07-12 17:47:10.338 [ERROR][56] cni-config-monitor/token_watch.go 106: Unable to create token for CNI kubeconfig error=serviceaccounts "calico-node" is forbidden: User "system:serviceaccount:kube-system:calico-node" cannot create resource "serviceaccounts/token" in API group "" in the namespace "kube-system"
2023-07-12T17:47:10.338Z | 2023-07-12 17:47:10.338 [ERROR][56] cni-config-monitor/token_watch.go 130: Failed to update CNI token, retrying... error=serviceaccounts "calico-node" is forbidden: User "system:serviceaccount:kube-system:calico-node" cannot create resource "serviceaccounts/token" in API group "" in the namespace "kube-system"
2023-07-12T17:47:11.456Z | 2023-07-12 17:47:11.456 [INFO][30387] felix/summary.go 100: Summarising 26 dataplane reconciliation loops over 1m9s: avg=16ms longest=107ms ()
2023-07-12T17:47:16.986Z | 2023-07-12 17:47:16.986 [ERROR][56] cni-config-monitor/token_watch.go 106: Unable to create token for CNI kubeconfig error=serviceaccounts "calico-node" is forbidden: User "system:serviceaccount:kube-system:calico-node" cannot create resource "serviceaccounts/token" in API group "" in the namespace "kube-system"
2023-07-12T17:47:16.986Z | 2023-07-12 17:47:16.986 [ERROR][56] cni-config-monitor/token_watch.go 130: Failed to update CNI token, retrying... error=serviceaccounts "calico-node" is forbidden: User "system:serviceaccount:kube-system:calico-node" cannot create resource "serviceaccounts/token" in API group "" in the namespace "kube-system"
2023-07-12T17:47:22.143Z | 2023-07-12 17:47:22.142 [ERROR][56] cni-config-monitor/token_watch.go 106: Unable to create token for CNI kubeconfig error=serviceaccounts "calico-node" is forbidden: User "system:serviceaccount:kube-system:calico-node" cannot create resource "serviceaccounts/token" in API group "" in the namespace "kube-system"
2023-07-12T17:47:22.143Z | 2023-07-12 17:47:22.142 [ERROR][56] cni-config-monitor/token_watch.go 130: Failed to update CNI token, retrying... error=serviceaccounts "calico-node" is forbidden: User "system:serviceaccount:kube-system:calico-node" cannot create resource "serviceaccounts/token" in API group "" in the namespace "kube-system"

caseydavenport commented 1 year ago

Is the role calico-cni-plugin supposed to be allowed to create serviceaccount tokens?

Nope, the calico-cni-plugin serviceaccount should not be able to make tokens. However, calico-node should be allowed to create tokens for calico-cni-plugin.

Cluster Version: v3.25.1

This is interesting - it sounds like you're running with the code from Calico v3.25, but the RBAC from Calico v3.26, which would result in the problems you're seeing. The v3.25 code expects to have this RBAC:

kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: calico-node
rules:
  # Used for creating service account tokens to be used by the CNI plugin
  - apiGroups: [""]
    resources:
      - serviceaccounts/token
    resourceNames:
      - calico-node
    verbs:
      - create

Where as v3.26 expects this:

kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: calico-node
rules:
  # Used for creating service account tokens to be used by the CNI plugin
  - apiGroups: [""]
    resources:
      - serviceaccounts/token
    resourceNames:
      - calico-cni-plugin
    verbs:
      - create

DennisNemec commented 1 year ago

We were able to fix this problem now. Our master node's had an incorrect version which ruined everything. An update of our master node was fortunately the solution without any hacky workarounds. But thanks for the help - I appreciate that!

projectcalico / calico