Pod add/delete operations result in "error getting ClusterInformation: connection is unauthorized: Unauthorized"

steven-webster commented 3 weeks ago

OS: Linux kernel v. 6.6.0-1 Kubernetes version v1.29.2 Calico version v3.28.0

Hello, I understand this issue has been raised/closed before with a potential solution in https://github.com/projectcalico/calico/pull/5910. We should have this commit in our Calico 3.28.0, but have recently experienced the "connection is unauthorized: Unauthorized" in a couple of systems. The workaround as others have mentioned in other bug reports seems to be to simply restart the calico-node pod. These systems have been re-installed as well, with no re-occurance of the particular message. We retain logs from the affected / problematic systems however, which may help.

Looking at the calico-node logs, there doesn't seem to be any issue reported by the 'token watch' facility, with updates/writes of the kubeconfig file always seeming to be successful. eg:

2024-10-19T20:50:00.035673627Z stdout F 2024-10-19 20:50:00.035 [INFO][81] cni-config-monitor/token_watch.go 226: Update of CNI kubeconfig triggered based on elapsed time. 2024-10-19T20:50:00.035909968Z stdout F 2024-10-19 20:50:00.035 [INFO][81] cni-config-monitor/token_watch.go 281: Wrote updated CNI kubeconfig file. path="/host/etc/cni/net.d/calico-kubeconfig"

Expected Behavior

Ability to start/stop pods.

Current Behavior

Here's an example of what one of the error messages looks like when trying to start a pod:

2024-10-22T22:30:01.648 controller-0 kubelet[13782]: info I1022 22:30:01.648705 13782 util.go:30] "No sandbox for pod can be found. Need to start a new one" pod="kube-system/ceph-pools-audit-28827270-65l9h" 2024-10-22T22:30:02.145 controller-0 kubelet[13782]: info E1022 22:30:02.145259 13782 remote_runtime.go:193] "RunPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to setup network for sandbox \"50242a9d9cc28594d0ef9190046f0fef6ab17a977ec922c25d28fc59f2a352b1\": plugin type=\"multus\" name=\"multus-cni-network\" failed (add): [kube-system/ceph-pools-audit-28827270-65l9h/21fbd5d4-046f-46fe-8d40-4b7b52f5daab:chain]: error adding container to network \"chain\": error getting ClusterInformation: connection is unauthorized: Unauthorized"

Possible Solution

Killing/restarting the calico-node pod resolves things.

Steps to Reproduce (for bugs)

Unfortunately I don't have a reliable way to reproduce this.

caseydavenport commented 2 weeks ago

: error adding container to network "chain": error getting ClusterInformation: connection is unauthorized: Unauthorized"

I've seen other instances of this caused by:

Date / time issues with the Nodes not being in sync with the apiserver.
Stale data in etcd - for example, after restoring Kubernetes etcd from a backup.
Bad TLS certificates being given to pods by Kubernetes

I think the generic "connection is unauthorized: Unauthorized" error message typically means something more fundamental is wrong with the system, rather than say using an expired token or having inadequate RBAC, both of which would result in a more specialized error message.

steven-webster commented 2 weeks ago

Hey @caseydavenport thanks for the response. In this system we are using NTP. Looking around the time the system started reporting the "connection is unauthorized: Unauthorized" I can see the following NTPQ info:

        remote           refid                st     t      when poll reach   delay   offset  jitter

================================================================================== info NTPQ: +2606:82c0:21::e 216.239.35.12 2 u 108 128 377 185.897 -0.994 0.243 info NTPQ: +64:ff9b::2d3d:b 194.0.5.123 3 u 78 128 377 64.570 -4.262 0.988 info NTPQ: *2620:10a:a001:a 206.108.0.132 2 u 57 128 377 0.257 +2.627 0.087 info NTPQ: +2620:10a:a001:d 90.139.163.33 3 s 101 128 376 50.174 +0.640 0.157 info NTPQ: +2606:82c0:21::e 216.239.35.12 2 u 11 128 377 185.704 -1.257 0.127 info NTPQ: +64:ff9b::2d3d:b 194.0.5.123 3 u 111 128 377 64.570 -4.262 0.921

Sorry if this formatting is bad, but we do see a spike around -4.262 milliseconds offset around where the problem happens. But looking at the rest of the logs, it doesn't seem like this is an abberation from the whole. As well, this particular one is to an external NTP server via 6to4.

I guess it's just a mystery to me why if we restart the calico-node pod (and in other workarounds with this similar issue) that things resolve themselves.

caseydavenport commented 1 week ago

Sounds unlikely to be time mismatch then.

hard to say what else it might be - I'd probably recommend that next time this happens, take a look at the Kubernetes apiserver logs to see if it can provide an explanation for why it is rejecting the requests as Unauthorized - sometimes there is additional context in those logs.

projectcalico / calico