Closed S1lverhead closed 10 months ago
@S1lverhead I'm not sure I completely understand what the guide at https://www.server-world.info/en/note?os=Fedora_36&p=kubernetes&f=1 says, so I can't be certain that there isn't an issue there with the cluster. But please, don't follow their steps for installing Calico, switch to the quick start guide once you have the cluster up.
As for installing Calico per https://projectcalico.docs.tigera.io/getting-started/kubernetes/quickstart, could you try that again in a fresh cluster using the latest version (v3.24)? It should have the fix to https://github.com/projectcalico/calico/issues/6087 so you don't need to edit the custom resources (though, if you need to, remember to add both master
and control-plane
tolerations, not just one of them):
kubectl create -f https://raw.githubusercontent.com/projectcalico/calico/v3.24.0/manifests/tigera-operator.yaml
kubectl create -f https://raw.githubusercontent.com/projectcalico/calico/v3.24.0/manifests/custom-resources.yaml
kubernetes v1.25.0
calico v3.24.1
ubuntu 22.04 LTS
what i do:
root@node1:~# kubectl -n kubevirt describe pods virt-operator-6fc7f6fdb9-4km55
Name: virt-operator-6fc7f6fdb9-4km55
Namespace: kubevirt
Priority: 1000000000
Priority Class Name: kubevirt-cluster-critical
Node: node2/192.168.72.51
Start Time: Wed, 14 Sep 2022 22:06:41 +0800
Labels: kubevirt.io=virt-operator
name=virt-operator
pod-template-hash=6fc7f6fdb9
prometheus.kubevirt.io=true
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/virt-operator-6fc7f6fdb9
Containers:
virt-operator:
Container ID:
Image: quay.io/kubevirt/virt-operator:v0.57.0
Image ID:
Ports: 8443/TCP, 8444/TCP
Host Ports: 0/TCP, 0/TCP
Command:
virt-operator
Args:
--port
8443
-v
2
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Requests:
cpu: 10m
memory: 250Mi
Readiness: http-get https://:8443/metrics delay=5s timeout=10s period=10s #success=1 #failure=3
Environment:
OPERATOR_IMAGE: quay.io/kubevirt/virt-operator:v0.57.0
WATCH_NAMESPACE: (v1:metadata.annotations['olm.targetNamespaces'])
Mounts:
/etc/virt-operator/certificates from kubevirt-operator-certs (ro)
/profile-data from profile-data (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-mnssq (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kubevirt-operator-certs:
Type: Secret (a volume populated by a Secret)
SecretName: kubevirt-operator-certs
Optional: true
profile-data:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
kube-api-access-mnssq:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: kubernetes.io/os=linux
Tolerations: CriticalAddonsOnly op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedCreatePodSandBox 76s (x517 over 114m) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "d8414711de17e75587d90d1dfbd470f5575830da43056576ac98e2ba37e93880": plugin type="calico" failed (add): error getting ClusterInformation: connection is unauthorized: Unauthorized
create snapshot for vm nodes, vm is vmware vsphere environment recovery vm nodes from snapshot and then i got some error from pods
I wonder if the snapshot is somehow still using old credentials when its restored. That would result in an unauthorized message.
Maybe it is some timestamp issue? I have exactly the same issue with my nodes running in KVM after reverting a Snapshot. When I terminate all calico pods everything works fine, after they have been recreated.
We are having a similar problem with OKE (Oracle Kubernetes) clusters using flannel and installing Calico on top to have network policies.
After some days (2-3), the pods in the cluster can't be deleted nor created due to this error getting ClusterInformation: connection is unauthorized: Unauthorized
error. If we recreate all calico-node pods, everything starts working again.
This is our current Installation (K8s v1.23.4
and Calico 3.24.3
):
---
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
# Required to set Network policies on top of Flannel.
- https://raw.githubusercontent.com/projectcalico/calico/v3.24.3/manifests/calico-policy-only.yaml
patchesJson6902:
# Oracle 8 needs these for network policies.
- target:
group: apps
version: v1
kind: DaemonSet
name: calico-node
namespace: kube-system
patch: |-
- op: add
path: "/spec/template/spec/containers/0/env/-"
value:
name: FELIX_IPTABLESBACKEND
value: NFT
I think we just found out how to fix the issue! (at least in our use case). It's been working for 4 days and it continues to be correct (before that, after 2-3 days started throwing the errors written on this same issue).
TL;DR: Check CALICO_MANAGE_CNI
is not set to false
(is true
by default).
After digging in the code, I realized that we need to run calico node's --monitor-token
so the CNI plugin configuration is updated (including the token), if it's not running, when the plugin is executed (pod creation/delete), eventually will load an expired/revoked token from the non-updated config files:
Some of the installations that Calico offers have this turned off (Not sure if after the addition of #5910 , it should be on by default on all of them):
So to enable it by overriding the env var value, I added this to the above Kustomization file:
- op: add
path: "/spec/template/spec/containers/0/env/-"
value:
name: CALICO_MANAGE_CNI
value: "true"
Oh GREAT! I'll check that out and report my results, too.
https://projectcalico.docs.tigera.io/reference/node/configuration shows under the tab "Manifest" the aforementioned variable. How would I do this in the operator? I am not using a Kustomization file (yet)
How would I do this in the operator?
@Jeansen sorry for the late response, but newer operator releases automatically enable this so there shouldn't be any extra configuration required.
In our cluster, we have "CALICO_MANAGE_CNI" is "true" and we are running with latest calico version , v3.24.0. Still we are facing the issue of " connection is unauthorized: Unauthorized" .
We are suspecting the issue is due to k8s certificates got changed and calico is not referring to latest certificates. Can you please point us any knows fixes for the same.
Another potential issue here is time synchronization across nodes - I've seen that elsewhere recently as well.
We are suspecting the issue is due to k8s certificates got changed and calico is not referring to latest certificates
Calico just uses the certs that Kubernetes gives to us, there's no logic within Calico to do this. It's likely a Kubernetes misconfiguration.
Another potential issue here is time synchronization across nodes - I've seen that elsewhere recently as well.
We are suspecting the issue is due to k8s certificates got changed and calico is not referring to latest certificates
Calico just uses the certs that Kubernetes gives to us, there's no logic within Calico to do this. It's likely a Kubernetes misconfiguration.
Thanks for the response, as part of our test we are time shifting ( simulating to expire the certificates/token ) on nodes , after which we have started observing few pods are getting below error.
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "xxxxxxxxxxxxxxxxxxxxx": plugin type="multus" name="multus-cni-network" failed (add): [ingress-nginx/nginx-ingress-controller-5fdfc65d94-zwsbc:k8s-pod-network]: error adding container to network "k8s-pod-network": Unauthorized
After restarting the ccalico-node pod, the issue is resolved ( nginx pod was in Running state ).
This behavior was not seen older version of calico , v3.23.0 , we have seen this issue after uplifting calico to v3.24.0
after which we have started observing few pods are getting below error.
Is this after shifting the time on the control plane node or on the worker node?
In v3.24.0, the way that Calico manages the CNI token used to authorize with the API server changed pretty drastically in order to resolve several problems we were seeing in newer versions of Kubernetes that handle token expiry and rotation differently. This was the main PR for that: https://github.com/projectcalico/calico/pull/5910
after which we have started observing few pods are getting below error.
Is this after shifting the time on the control plane node or on the worker node?
In v3.24.0, the way that Calico manages the CNI token used to authorize with the API server changed pretty drastically in order to resolve several problems we were seeing in newer versions of Kubernetes that handle token expiry and rotation differently. This was the main PR for that: #5910
We are shifting the time both on control plane ( master nodes ) and on worker nodes to make sure certificates/token expire and to be rotated.
Calico node will request the token using TokenRequest API and update the token to "/host/etc/cni/net.d/calico-kubeconfig" , which CNI will use to authenticate API server .
What if API server token got rotated ? does calico pod will request new token proactively ( Since the token expire time has not triggered ) , if calico pod request the new token does it get the new token (as API server tokens are rotated, the current calico pod token is not valid to get the new token )? Do we have logic in calico node to handle this ? Please suggest us how to resolve this issue ?
if calico pod request the new token does it get the new token (as API server tokens are rotated, the current calico pod token is not valid to get the new token )?
There's no logic to re-query a token if say, a request fails. Calico will periodically refresh the token, but if you change the server cert you will likely need to trigger a rolling update of Calico in order to get valid credentials.
if calico was not able to identify the Ethernet card property, for example, it was configured to detect the eth but on machine it was configured as ens so placing the regex helps it to identify the ethernet card and associated ip properly. https://www.unixcloudfusion.in/2022/02/solved-caliconode-is-not-ready-bird-is.html
Well, the link above seems to be a bit outdated (references k8s 1.14) and the manifest changes are obsolete, too. Anyway, the hint with the interface regex seems to have done the trick, so far. Here's how to do it with the calico Installation CRD: https://docs.tigera.io/calico/latest/networking/ipam/ip-autodetection
For a quick reference, here is what I added to my Installation settings:
kind: Installation
apiVersion: operator.tigera.io/v1
metadata:
name: default
spec:
calicoNetwork:
nodeAddressAutodetectionV4:
interface: eth.*
Important is the interface regex. My nodes run in KVM and have two interfaces. One is a bridge the other is a plain host network. And then there are some tunl and veth interfaces among the other cni and calico ones.
So, my current 'Installation" looks like this:
apiVersion: operator.tigera.io/v1
kind: Installation
metadata:
name: default
spec:
calicoNetwork:
nodeAddressAutodetectionV4:
interface: eth.*
ipPools:
- blockSize: 26
cidr: 10.244.0.0/16
encapsulation: VXLANCrossSubnet
natOutgoing: Enabled
nodeSelector: all()
All id did was adding:
nodeAddressAutodetectionV4:
interface: eth.*
Thanks for the updates guys. I think we can mark this one as done now.
@caseydavenport Sorry to say, but seems like the issue is still present. Again got some "network": plugin type="calico" failed (add): error getting ClusterInformation: connection is unauthorized: Unauthorized" After reverting to an older snapshot and though my time is synced and I applied the settings I posted yesterday. I was so confident ...
It happened in a few days, kubectl apply -f xxxx.yaml
Type Reason Age From Message
Normal Scheduled 79s default-scheduler Successfully assigned default/nginx-deploy-96c55d769-sdkvt to k8s-node2 Warning FailedCreatePodSandBox 80s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "0824863093e1d44bc9bb162a82e4300dcfe8283cd489070dea095be3817f290f": plugin type="calico" failed (add): error getting ClusterInformation: connection is unauthorized: Unauthorized Normal SandboxChanged 3s (x7 over 79s) kubelet Pod sandbox changed, it will be killed and re-created.
Hello. Same issue
K8S version: v1.23.5 Calico node image : quay.io/calico/node:v3.20.3
Hi Guys, I've also encountered this.
Kubernetes version: 1.23.17 Calico node image: docker.io/calico/node:v3.21.5
temporary resolution: kubectl delete po -l k8s-app=calico-node -nkube-system
, then pods restart and problem resolved.
Maybe you guys should see post https://kb.vmware.com/s/article/95457
create snapshot for vm nodes, vm is vmware vsphere environment recovery vm nodes from snapshot and then i got some error from pods
I wonder if the snapshot is somehow still using old credentials when its restored. That would result in an unauthorized message.
restarting the calico pods resolved the issue as suggested by @caseydavenport it was using old creds i guess
Pods can't join/create network because of authorization problem. This is follow-up for Calico issue #5712
Expected Behavior
pods start successfuly
Current Behavior
pods report error: (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_coredns-6d4b75cb6d-vc6lr_kube-system_19f66e13-7c95-43cb-b1d6-9a0e1d7deb29_0(d62b74d28a2d5a333912418806a7b287f4fda4c67a9a2a1ccc9fe76c7bcfe889): error adding pod kube-system_coredns-6d4b75cb6d-vc6lr to CNI network "k8s-pod-network": plugin type="calico" failed (add): error getting ClusterInformation: connection is unauthorized: Unauthorized
Possible Solution
Steps to Reproduce (for bugs)
1) Install one node cluster using: https://www.server-world.info/en/note?os=Fedora_36&p=kubernetes&f=1 2) Their method of Calico installation trigger Calico bug #6087 2a) Node untainted + custom-resources.yaml edited 3) Install Calico using https://projectcalico.docs.tigera.io/getting-started/kubernetes/quickstart with CustomResources.yaml edit from bug #6087 4) Deploy test pod: kubectl create deployment test-nginx --image=nginx 5) kubectl describe shows error above. 6) turn off macbook with testing fedora VM. Enjoy good night sleep. Turn on laptop, start Fedora VM. 7) Issue expanded from test pod to coredns pods. 8) I noticed the note in "quick guide" about namespaces and tried to redeploy calico to kube-system namespace using command: kubectl create --namespace=kube-system -f tigera-operator.yaml
but failed with errors similar to: "the namespace from the provided object "tigera-operator" does not match the namespace "kube-system". You must pass '--namespace=tigera-operator' to perform this operation." I'm not sure how to edit tigera-operator.yaml. Namespace seems to be hardcoded. 9) I tried to create container in the same NS as calico(tigera-operator) it failed with the same error.
Context
Your Environment
Fully updated Fedora 36 install on Macbook using vmWare Fusion. Kubernetes: [root@malanjan Kubernetes]# dnf info kubernetes Name : kubernetes Version : 1.24.1 Release : 7.fc36
Calico: Latest using: kubectl create -f https://projectcalico.docs.tigera.io/manifests/tigera-operator.yaml kubectl create -f https://projectcalico.docs.tigera.io/manifests/custom-resources.yaml
[root@malanjan Kubernetes]# kubectl get nodes NAME STATUS ROLES AGE VERSION malanjan Ready control-plane 24h v1.24.1
root@malanjan Kubernetes]# kubectl get pods -A NAMESPACE NAME READY STATUS RESTARTS AGE default test-nginx-7f4c594655-qzgmv 0/1 ContainerCreating 0 19h kube-system coredns-6d4b75cb6d-8w7f5 0/1 ContainerCreating 1 24h kube-system coredns-6d4b75cb6d-vc6lr 0/1 ContainerCreating 1 24h kube-system etcd-malanjan 1/1 Running 7 (77m ago) 24h kube-system kube-apiserver-malanjan 1/1 Running 5 (77m ago) 24h kube-system kube-controller-manager-malanjan 1/1 Running 2 24h kube-system kube-proxy-7djp4 1/1 Running 2 24h kube-system kube-scheduler-malanjan 1/1 Running 4 24h tigera-operator tigera-operator-6995cc5df5-gcmxf 1/1 Running 7 (76m ago) 19h
root@malanjan Kubernetes]# kubectl describe -n kube-system pods/coredns-6d4b75cb6d-vc6lr Name: coredns-6d4b75cb6d-vc6lr Namespace: kube-system Priority: 2000000000 Priority Class Name: system-cluster-critical Node: malanjan/192.168.42.131 Start Time: Thu, 11 Aug 2022 11:46:00 +0200 Labels: k8s-app=kube-dns pod-template-hash=6d4b75cb6d Annotations:
Status: Running
IP:
IPs:
Controlled By: ReplicaSet/coredns-6d4b75cb6d
Containers:
coredns:
Container ID:
Image: k8s.gcr.io/coredns/coredns:v1.8.6
Image ID:
Ports: 53/UDP, 53/TCP, 9153/TCP
Host Ports: 0/UDP, 0/TCP, 0/TCP
Args:
-conf
/etc/coredns/Corefile
State: Waiting
Reason: ContainerCreating
Last State: Terminated
Reason: ContainerStatusUnknown
Message: The container could not be located when the pod was deleted. The container used to be Running
Exit Code: 137
Started: Mon, 01 Jan 0001 00:00:00 +0000
Finished: Mon, 01 Jan 0001 00:00:00 +0000
Ready: False
Restart Count: 1
Limits:
memory: 170Mi
Requests:
cpu: 100m
memory: 70Mi
Liveness: http-get http://:8080/health delay=60s timeout=5s period=10s #success=1 #failure=5
Readiness: http-get http://:8181/ready delay=0s timeout=1s period=10s #success=1 #failure=3
Environment:
Mounts:
/etc/coredns from config-volume (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8pc5v (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
config-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: coredns
Optional: false
kube-api-access-8pc5v:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional:
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: kubernetes.io/os=linux
Tolerations: CriticalAddonsOnly op=Exists
node-role.kubernetes.io/control-plane:NoSchedule
node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
Warning FailedCreatePodSandBox 4m13s (x264 over 63m) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_coredns-6d4b75cb6d-vc6lr_kube-system_19f66e13-7c95-43cb-b1d6-9a0e1d7deb29_0(d62b74d28a2d5a333912418806a7b287f4fda4c67a9a2a1ccc9fe76c7bcfe889): error adding pod kube-system_coredns-6d4b75cb6d-vc6lr to CNI network "k8s-pod-network": plugin type="calico" failed (add): error getting ClusterInformation: connection is unauthorized: Unauthorized