Error getting ClusterInformation: connection is unauthorized

S1lverhead commented 2 years ago

Pods can't join/create network because of authorization problem. This is follow-up for Calico issue #5712

Expected Behavior

pods start successfuly

Current Behavior

pods report error: (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_coredns-6d4b75cb6d-vc6lr_kube-system_19f66e13-7c95-43cb-b1d6-9a0e1d7deb29_0(d62b74d28a2d5a333912418806a7b287f4fda4c67a9a2a1ccc9fe76c7bcfe889): error adding pod kube-system_coredns-6d4b75cb6d-vc6lr to CNI network "k8s-pod-network": plugin type="calico" failed (add): error getting ClusterInformation: connection is unauthorized: Unauthorized

Possible Solution

Steps to Reproduce (for bugs)

1) Install one node cluster using: https://www.server-world.info/en/note?os=Fedora_36&p=kubernetes&f=1 2) Their method of Calico installation trigger Calico bug #6087 2a) Node untainted + custom-resources.yaml edited 3) Install Calico using https://projectcalico.docs.tigera.io/getting-started/kubernetes/quickstart with CustomResources.yaml edit from bug #6087 4) Deploy test pod: kubectl create deployment test-nginx --image=nginx 5) kubectl describe shows error above. 6) turn off macbook with testing fedora VM. Enjoy good night sleep. Turn on laptop, start Fedora VM. 7) Issue expanded from test pod to coredns pods. 8) I noticed the note in "quick guide" about namespaces and tried to redeploy calico to kube-system namespace using command: kubectl create --namespace=kube-system -f tigera-operator.yaml
but failed with errors similar to: "the namespace from the provided object "tigera-operator" does not match the namespace "kube-system". You must pass '--namespace=tigera-operator' to perform this operation." I'm not sure how to edit tigera-operator.yaml. Namespace seems to be hardcoded. 9) I tried to create container in the same NS as calico(tigera-operator) it failed with the same error.

Context

Your Environment

Fully updated Fedora 36 install on Macbook using vmWare Fusion. Kubernetes: [root@malanjan Kubernetes]# dnf info kubernetes Name : kubernetes Version : 1.24.1 Release : 7.fc36

Calico: Latest using: kubectl create -f https://projectcalico.docs.tigera.io/manifests/tigera-operator.yaml kubectl create -f https://projectcalico.docs.tigera.io/manifests/custom-resources.yaml

[root@malanjan Kubernetes]# kubectl get nodes NAME STATUS ROLES AGE VERSION malanjan Ready control-plane 24h v1.24.1

root@malanjan Kubernetes]# kubectl get pods -A NAMESPACE NAME READY STATUS RESTARTS AGE default test-nginx-7f4c594655-qzgmv 0/1 ContainerCreating 0 19h kube-system coredns-6d4b75cb6d-8w7f5 0/1 ContainerCreating 1 24h kube-system coredns-6d4b75cb6d-vc6lr 0/1 ContainerCreating 1 24h kube-system etcd-malanjan 1/1 Running 7 (77m ago) 24h kube-system kube-apiserver-malanjan 1/1 Running 5 (77m ago) 24h kube-system kube-controller-manager-malanjan 1/1 Running 2 24h kube-system kube-proxy-7djp4 1/1 Running 2 24h kube-system kube-scheduler-malanjan 1/1 Running 4 24h tigera-operator tigera-operator-6995cc5df5-gcmxf 1/1 Running 7 (76m ago) 19h

root@malanjan Kubernetes]# kubectl describe -n kube-system pods/coredns-6d4b75cb6d-vc6lr Name: coredns-6d4b75cb6d-vc6lr Namespace: kube-system Priority: 2000000000 Priority Class Name: system-cluster-critical Node: malanjan/192.168.42.131 Start Time: Thu, 11 Aug 2022 11:46:00 +0200 Labels: k8s-app=kube-dns pod-template-hash=6d4b75cb6d Annotations: Status: Running IP: IPs: Controlled By: ReplicaSet/coredns-6d4b75cb6d Containers: coredns: Container ID: Image: k8s.gcr.io/coredns/coredns:v1.8.6 Image ID: Ports: 53/UDP, 53/TCP, 9153/TCP Host Ports: 0/UDP, 0/TCP, 0/TCP Args: -conf /etc/coredns/Corefile State: Waiting Reason: ContainerCreating Last State: Terminated Reason: ContainerStatusUnknown Message: The container could not be located when the pod was deleted. The container used to be Running Exit Code: 137 Started: Mon, 01 Jan 0001 00:00:00 +0000 Finished: Mon, 01 Jan 0001 00:00:00 +0000 Ready: False Restart Count: 1 Limits: memory: 170Mi Requests: cpu: 100m memory: 70Mi Liveness: http-get http://:8080/health delay=60s timeout=5s period=10s #success=1 #failure=5 Readiness: http-get http://:8181/ready delay=0s timeout=1s period=10s #success=1 #failure=3 Environment: Mounts: /etc/coredns from config-volume (ro) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8pc5v (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: config-volume: Type: ConfigMap (a volume populated by a ConfigMap) Name: coredns Optional: false kube-api-access-8pc5v: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: DownwardAPI: true QoS Class: Burstable Node-Selectors: kubernetes.io/os=linux Tolerations: CriticalAddonsOnly op=Exists node-role.kubernetes.io/control-plane:NoSchedule node-role.kubernetes.io/master:NoSchedule node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message

Warning FailedCreatePodSandBox 4m13s (x264 over 63m) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_coredns-6d4b75cb6d-vc6lr_kube-system_19f66e13-7c95-43cb-b1d6-9a0e1d7deb29_0(d62b74d28a2d5a333912418806a7b287f4fda4c67a9a2a1ccc9fe76c7bcfe889): error adding pod kube-system_coredns-6d4b75cb6d-vc6lr to CNI network "k8s-pod-network": plugin type="calico" failed (add): error getting ClusterInformation: connection is unauthorized: Unauthorized

coutinhop commented 2 years ago

@S1lverhead I'm not sure I completely understand what the guide at https://www.server-world.info/en/note?os=Fedora_36&p=kubernetes&f=1 says, so I can't be certain that there isn't an issue there with the cluster. But please, don't follow their steps for installing Calico, switch to the quick start guide once you have the cluster up.

As for installing Calico per https://projectcalico.docs.tigera.io/getting-started/kubernetes/quickstart, could you try that again in a fresh cluster using the latest version (v3.24)? It should have the fix to https://github.com/projectcalico/calico/issues/6087 so you don't need to edit the custom resources (though, if you need to, remember to add both master and control-plane tolerations, not just one of them):

kubectl create -f https://raw.githubusercontent.com/projectcalico/calico/v3.24.0/manifests/tigera-operator.yaml
kubectl create -f https://raw.githubusercontent.com/projectcalico/calico/v3.24.0/manifests/custom-resources.yaml

willzhang commented 2 years ago

kubernetes v1.25.0
calico v3.24.1 ubuntu 22.04 LTS

what i do:

i create kuberetes cluster with kubeadm
create snapshot for vm nodes, vm is vmware vsphere environment
recovery vm nodes from snapshot and then i got some error from pods

root@node1:~# kubectl -n kubevirt describe pods virt-operator-6fc7f6fdb9-4km55 
Name:                 virt-operator-6fc7f6fdb9-4km55
Namespace:            kubevirt
Priority:             1000000000
Priority Class Name:  kubevirt-cluster-critical
Node:                 node2/192.168.72.51
Start Time:           Wed, 14 Sep 2022 22:06:41 +0800
Labels:               kubevirt.io=virt-operator
                      name=virt-operator
                      pod-template-hash=6fc7f6fdb9
                      prometheus.kubevirt.io=true
Annotations:          <none>
Status:               Pending
IP:                   
IPs:                  <none>
Controlled By:        ReplicaSet/virt-operator-6fc7f6fdb9
Containers:
  virt-operator:
    Container ID:  
    Image:         quay.io/kubevirt/virt-operator:v0.57.0
    Image ID:      
    Ports:         8443/TCP, 8444/TCP
    Host Ports:    0/TCP, 0/TCP
    Command:
      virt-operator
    Args:
      --port
      8443
      -v
      2
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:      10m
      memory:   250Mi
    Readiness:  http-get https://:8443/metrics delay=5s timeout=10s period=10s #success=1 #failure=3
    Environment:
      OPERATOR_IMAGE:   quay.io/kubevirt/virt-operator:v0.57.0
      WATCH_NAMESPACE:   (v1:metadata.annotations['olm.targetNamespaces'])
    Mounts:
      /etc/virt-operator/certificates from kubevirt-operator-certs (ro)
      /profile-data from profile-data (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-mnssq (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  kubevirt-operator-certs:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  kubevirt-operator-certs
    Optional:    true
  profile-data:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  kube-api-access-mnssq:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              kubernetes.io/os=linux
Tolerations:                 CriticalAddonsOnly op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason                  Age                   From     Message
  ----     ------                  ----                  ----     -------
  Warning  FailedCreatePodSandBox  76s (x517 over 114m)  kubelet  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "d8414711de17e75587d90d1dfbd470f5575830da43056576ac98e2ba37e93880": plugin type="calico" failed (add): error getting ClusterInformation: connection is unauthorized: Unauthorized

caseydavenport commented 2 years ago

create snapshot for vm nodes, vm is vmware vsphere environment recovery vm nodes from snapshot and then i got some error from pods

I wonder if the snapshot is somehow still using old credentials when its restored. That would result in an unauthorized message.

Jeansen commented 1 year ago

Maybe it is some timestamp issue? I have exactly the same issue with my nodes running in KVM after reverting a Snapshot. When I terminate all calico pods everything works fine, after they have been recreated.

slok commented 1 year ago

We are having a similar problem with OKE (Oracle Kubernetes) clusters using flannel and installing Calico on top to have network policies.

After some days (2-3), the pods in the cluster can't be deleted nor created due to this error getting ClusterInformation: connection is unauthorized: Unauthorized error. If we recreate all calico-node pods, everything starts working again.

This is our current Installation (K8s v1.23.4 and Calico 3.24.3):

---
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
  # Required to set Network policies on top of Flannel.
  - https://raw.githubusercontent.com/projectcalico/calico/v3.24.3/manifests/calico-policy-only.yaml

patchesJson6902:
# Oracle 8 needs these for network policies.
- target:
    group: apps
    version: v1
    kind: DaemonSet
    name: calico-node
    namespace: kube-system
  patch: |-
    - op: add
      path: "/spec/template/spec/containers/0/env/-"
      value:
        name: FELIX_IPTABLESBACKEND
        value: NFT

slok commented 1 year ago

I think we just found out how to fix the issue! (at least in our use case). It's been working for 4 days and it continues to be correct (before that, after 2-3 days started throwing the errors written on this same issue).

TL;DR: Check CALICO_MANAGE_CNI is not set to false (is true by default).

After digging in the code, I realized that we need to run calico node's --monitor-token so the CNI plugin configuration is updated (including the token), if it's not running, when the plugin is executed (pod creation/delete), eventually will load an expired/revoked token from the non-updated config files:

https://github.com/projectcalico/calico/blob/9956c64201e223c1fd4132838171e688757235bb/node/cmd/calico-node/main.go#L60

https://github.com/projectcalico/calico/blob/9956c64201e223c1fd4132838171e688757235bb/node/cmd/calico-node/main.go#L177-L179

https://github.com/projectcalico/calico/blob/9956c64201e223c1fd4132838171e688757235bb/node/pkg/cni/token_watch.go#L215-L238

Some of the installations that Calico offers have this turned off (Not sure if after the addition of #5910 , it should be on by default on all of them):

https://github.com/projectcalico/calico/blob/9956c64201e223c1fd4132838171e688757235bb/manifests/calico-policy-only.yaml#L4511-L4513

So to enable it by overriding the env var value, I added this to the above Kustomization file:

    - op: add
      path: "/spec/template/spec/containers/0/env/-"
      value:
        name: CALICO_MANAGE_CNI
        value: "true"

Jeansen commented 1 year ago

Oh GREAT! I'll check that out and report my results, too.

Jeansen commented 1 year ago

https://projectcalico.docs.tigera.io/reference/node/configuration shows under the tab "Manifest" the aforementioned variable. How would I do this in the operator? I am not using a Kustomization file (yet)

caseydavenport commented 1 year ago

How would I do this in the operator?

@Jeansen sorry for the late response, but newer operator releases automatically enable this so there shouldn't be any extra configuration required.

HalaharviPedda commented 1 year ago

In our cluster, we have "CALICO_MANAGE_CNI" is "true" and we are running with latest calico version , v3.24.0. Still we are facing the issue of " connection is unauthorized: Unauthorized" .

We are suspecting the issue is due to k8s certificates got changed and calico is not referring to latest certificates. Can you please point us any knows fixes for the same.

caseydavenport commented 1 year ago

Another potential issue here is time synchronization across nodes - I've seen that elsewhere recently as well.

We are suspecting the issue is due to k8s certificates got changed and calico is not referring to latest certificates

Calico just uses the certs that Kubernetes gives to us, there's no logic within Calico to do this. It's likely a Kubernetes misconfiguration.

HalaharviPedda commented 1 year ago

Another potential issue here is time synchronization across nodes - I've seen that elsewhere recently as well.

We are suspecting the issue is due to k8s certificates got changed and calico is not referring to latest certificates

Calico just uses the certs that Kubernetes gives to us, there's no logic within Calico to do this. It's likely a Kubernetes misconfiguration.

Thanks for the response, as part of our test we are time shifting ( simulating to expire the certificates/token ) on nodes , after which we have started observing few pods are getting below error.

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "xxxxxxxxxxxxxxxxxxxxx": plugin type="multus" name="multus-cni-network" failed (add): [ingress-nginx/nginx-ingress-controller-5fdfc65d94-zwsbc:k8s-pod-network]: error adding container to network "k8s-pod-network": Unauthorized

After restarting the ccalico-node pod, the issue is resolved ( nginx pod was in Running state ).

This behavior was not seen older version of calico , v3.23.0 , we have seen this issue after uplifting calico to v3.24.0

caseydavenport commented 1 year ago

after which we have started observing few pods are getting below error.

Is this after shifting the time on the control plane node or on the worker node?

In v3.24.0, the way that Calico manages the CNI token used to authorize with the API server changed pretty drastically in order to resolve several problems we were seeing in newer versions of Kubernetes that handle token expiry and rotation differently. This was the main PR for that: https://github.com/projectcalico/calico/pull/5910

HalaharviPedda commented 1 year ago

after which we have started observing few pods are getting below error.

Is this after shifting the time on the control plane node or on the worker node?

In v3.24.0, the way that Calico manages the CNI token used to authorize with the API server changed pretty drastically in order to resolve several problems we were seeing in newer versions of Kubernetes that handle token expiry and rotation differently. This was the main PR for that: #5910

We are shifting the time both on control plane ( master nodes ) and on worker nodes to make sure certificates/token expire and to be rotated.

Calico node will request the token using TokenRequest API and update the token to "/host/etc/cni/net.d/calico-kubeconfig" , which CNI will use to authenticate API server .

What if API server token got rotated ? does calico pod will request new token proactively ( Since the token expire time has not triggered ) , if calico pod request the new token does it get the new token (as API server tokens are rotated, the current calico pod token is not valid to get the new token )? Do we have logic in calico node to handle this ? Please suggest us how to resolve this issue ?

caseydavenport commented 1 year ago

if calico pod request the new token does it get the new token (as API server tokens are rotated, the current calico pod token is not valid to get the new token )?

There's no logic to re-query a token if say, a request fails. Calico will periodically refresh the token, but if you change the server cert you will likely need to trigger a rolling update of Calico in order to get valid credentials.

microyahoo commented 1 year ago

if calico was not able to identify the Ethernet card property, for example, it was configured to detect the eth but on machine it was configured as ens so placing the regex helps it to identify the ethernet card and associated ip properly. https://www.unixcloudfusion.in/2022/02/solved-caliconode-is-not-ready-bird-is.html

Jeansen commented 1 year ago

Well, the link above seems to be a bit outdated (references k8s 1.14) and the manifest changes are obsolete, too. Anyway, the hint with the interface regex seems to have done the trick, so far. Here's how to do it with the calico Installation CRD: https://docs.tigera.io/calico/latest/networking/ipam/ip-autodetection

For a quick reference, here is what I added to my Installation settings:

kind: Installation
apiVersion: operator.tigera.io/v1
metadata:
  name: default
spec:
  calicoNetwork:
    nodeAddressAutodetectionV4:
      interface: eth.*

Important is the interface regex. My nodes run in KVM and have two interfaces. One is a bridge the other is a plain host network. And then there are some tunl and veth interfaces among the other cni and calico ones.

So, my current 'Installation" looks like this:

apiVersion: operator.tigera.io/v1
kind: Installation
metadata:
  name: default
spec:
  calicoNetwork:
    nodeAddressAutodetectionV4:
      interface: eth.*
    ipPools:
    - blockSize: 26
      cidr: 10.244.0.0/16
      encapsulation: VXLANCrossSubnet
      natOutgoing: Enabled
      nodeSelector: all()

All id did was adding:

    nodeAddressAutodetectionV4:
      interface: eth.*

caseydavenport commented 1 year ago

Thanks for the updates guys. I think we can mark this one as done now.

Jeansen commented 1 year ago

@caseydavenport Sorry to say, but seems like the issue is still present. Again got some "network": plugin type="calico" failed (add): error getting ClusterInformation: connection is unauthorized: Unauthorized" After reverting to an older snapshot and though my time is synced and I applied the settings I posted yesterday. I was so confident ...

Xuzan9396 commented 1 year ago

It happened in a few days， kubectl apply -f xxxx.yaml

Type Reason Age From Message

Normal Scheduled 79s default-scheduler Successfully assigned default/nginx-deploy-96c55d769-sdkvt to k8s-node2 Warning FailedCreatePodSandBox 80s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "0824863093e1d44bc9bb162a82e4300dcfe8283cd489070dea095be3817f290f": plugin type="calico" failed (add): error getting ClusterInformation: connection is unauthorized: Unauthorized Normal SandboxChanged 3s (x7 over 79s) kubelet Pod sandbox changed, it will be killed and re-created.

hajmehdikabir commented 1 year ago

Hello. Same issue

K8S version: v1.23.5 Calico node image : quay.io/calico/node:v3.20.3

BartoszZawadzki commented 11 months ago

Hi Guys, I've also encountered this.

Kubernetes version: 1.23.17 Calico node image: docker.io/calico/node:v3.21.5

chaseSpace commented 11 months ago

temporary resolution: kubectl delete po -l k8s-app=calico-node -nkube-system, then pods restart and problem resolved.

Maybe you guys should see post https://kb.vmware.com/s/article/95457

mahendera525 commented 10 months ago

create snapshot for vm nodes, vm is vmware vsphere environment recovery vm nodes from snapshot and then i got some error from pods

I wonder if the snapshot is somehow still using old credentials when its restored. That would result in an unauthorized message.

restarting the calico pods resolved the issue as suggested by @caseydavenport it was using old creds i guess

projectcalico / calico