projectcalico / calico

Cloud native networking and network security
https://docs.tigera.io/calico/latest/about/
Apache License 2.0
6.01k stars 1.34k forks source link

calico/node token is invalidated by Kubernetes when the pod is evicted, leading to CNI failures #4857

Closed dghubble closed 2 years ago

dghubble commented 3 years ago

Expected Behavior

Calico CNI plugin tears down Pod in a timely manner.

Current Behavior

Calico CNI plugin shows errors terminating Pods, and therefore eviction takes too long. Especially relevant in Kubernetes conformance testing.

Aug 18 18:19:04.521: INFO: At 2021-08-18 18:18:01 +0000 UTC - event for taint-eviction-a1: {kubelet ip-10-0-8-52} FailedKillPod: error killing pod: failed to "KillPodSandbox" for "0701ef9b-e
17d-43b5-a48f-89fa3ac00999" with KillPodSandboxError: "rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod \"taint-eviction-a1_taint-multiple-pods-4011\" network: error
 getting ClusterInformation: connection is unauthorized: Unauthorized"

The natural things to check are RBAC permissions, which match recommendations:

- apiGroups:
  - crd.projectcalico.org
  resources:
  - globalfelixconfigs
  - felixconfigurations
  - bgppeers
  - globalbgpconfigs
  - bgpconfigurations
  - ippools
  - ipamblocks
  - globalnetworkpolicies
  - globalnetworksets
  - networkpolicies
  - networksets
  - clusterinformations
  - hostendpoints
  - blockaffinities
  verbs:
  - get
  - list
  - watch
...

To be certain, we can use the actual kubeconfig Calico writes to the host's /etc/cni/net.d. It does indeed seem to have permission to get clusterinformations. The error above is unusual.

./kubectl --kubeconfig /etc/cni/net.d/calico-kubeconfig auth can-i get clusterinformations --all-namespaces
yes

Steps to Reproduce (for bugs)

sonobuoy run --e2e-focus="NoExecuteTaintManager Multiple Pods" --e2e-skip="" \
--plugin-env=e2e.E2E_EXTRA_ARGS="--non-blocking-taints=node-role.kubernetes.io/controller"

Context

This issue affects Kubernetes Conformance tests:

Summarizing 1 Failure:

[Fail] [sig-node] NoExecuteTaintManager Multiple Pods [Serial] [It] evicts pods with minTolerationSeconds [Disruptive] [Conformance] 
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/onsi/ginkgo/internal/leafnodes/runner.go:113

The test in question creates two Pods that don't tolerate a taint, and expects them to be terminated within certain times. In Kubelet logs, the Calico CNI plugin is complaining with the logs above and termination takes too long.

Your Environment

caseydavenport commented 3 years ago

Well that's certainly bizarre. Not sure what has changed in that area that might cause this issue, especially given the RBAC seems to have the correct permissions present! Will see what I can find..

caseydavenport commented 3 years ago

@dghubble could you confirm that the calico-kubeconfig file is actually the one referenced in Calico CNI configuration json? Just to make sure the plugin is actually using the file.

One thing that might be relevant here is that Kubernetes recently enabled service account projection by default: https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#bound-service-account-token-volume

You may need manifest changes so that Calico can maintain the credentials on disk as they are rotated. I see that you have the correct volume mount here: https://github.com/poseidon/terraform-render-bootstrap/blob/master/resources/calico/daemonset.yaml#L174-L177

You may also need to set CALICO_MANAGE_CNI=true in the calico/node env to enable the right logic, though.

dghubble commented 3 years ago

When 10-calico.conflist is written out, __KUBECONFIG_FILEPATH__ is replaced with /etc/cni/net.d/calico-kubeconfig. Within the calico-node container, the location of the mounted file is /host/cni/net.d/calico-kubeconfig. Maybe that's not what calico-node wants, but it's been this way a while.

That seems to match what Calico's release calico.yaml shows here and Calico reports no troubles finding a kubeconfig either.

Setting calico-node's env didn't seem to alter the result.

- name: CALICO_MANAGE_CNI                                                                                                                                                                 
    value: "true"
ilyesAj commented 3 years ago

we're facing the same issue here . is there any suggested solutions ?

Our Environment: Calico version: v3.19.3 Orchestrator version (e.g. kubernetes, mesos, rkt): Kubernetes v1.21.5 Operating System and version: ubuntu 20.04 provisioning tool: kops v1.21.1

caseydavenport commented 3 years ago

getting ClusterInformation: connection is unauthorized: Unauthorized"

The error is certainly unusual. I haven't been able to reproduce this at all in my own rigs. Generally, and RBAC issue would show something more precise - e.g., "serviceaccount X is not allowed to Y resouce Z".

Issues with bad TLS credentials would also show up more clearly. I'm not really sure the root cause of this, and I think to figure it out we probably need to dig into the API server logs to see why it is rejecting the request as unauthorized.

caseydavenport commented 2 years ago

@dghubble @ilyesAj it's been a long time on this one... did you ever make any headway on it? I haven't seen this anywhere else.

dghubble commented 2 years ago

To pass Kubernetes v1.22 conformance testing, Typhoon used flannel instead of Calico. I'll re-run with Calico during the v1.23 cycle.

ilyesAj commented 2 years ago

@caseydavenport we have rollbacked to k8s 1.20.11 , it's seems to be releated to the k8s version , we didn't have any problems since .

lwr20 commented 2 years ago

I had a go at reproing this.

That hit the same issue, I think:

Mar 18 10:28:43.496: FAIL: Failed to evict all Pods. 2 pod(s) is not evicted.

Full Stack Trace
k8s.io/kubernetes/test/e2e.RunE2ETests(0x23f7fb7)
    _output/dockerized/go/src/k8s.io/kubernetes/test/e2e/e2e.go:133 +0x697
k8s.io/kubernetes/test/e2e.TestE2E(0x2371919)
    _output/dockerized/go/src/k8s.io/kubernetes/test/e2e/e2e_test.go:136 +0x19
testing.tRunner(0xc000a1c680, 0x71553e0)
    /usr/local/go/src/testing/testing.go:1259 +0x102
created by testing.(*T).Run
    /usr/local/go/src/testing/testing.go:1306 +0x35a
STEP: verifying the node doesn't have the taint kubernetes.io/e2e-evict-taint-key=evictTaintVal:NoExecute
[AfterEach] [sig-node] NoExecuteTaintManager Multiple Pods [Serial]
  /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:186
STEP: Collecting events from namespace "taint-multiple-pods-5037".
STEP: Found 17 events.
Mar 18 10:28:45.166: INFO: At 2022-03-18 10:26:50 +0000 UTC - event for taint-eviction-b1: {default-scheduler } Scheduled: Successfully assigned taint-multiple-pods-5037/taint-eviction-b1 to ip-10-0-58-124
Mar 18 10:28:45.166: INFO: At 2022-03-18 10:26:50 +0000 UTC - event for taint-eviction-b2: {default-scheduler } Scheduled: Successfully assigned taint-multiple-pods-5037/taint-eviction-b2 to ip-10-0-58-124
Mar 18 10:28:45.166: INFO: At 2022-03-18 10:26:51 +0000 UTC - event for taint-eviction-b1: {kubelet ip-10-0-58-124} Pulled: Container image "k8s.gcr.io/pause:3.6" already present on machine
Mar 18 10:28:45.166: INFO: At 2022-03-18 10:26:51 +0000 UTC - event for taint-eviction-b1: {kubelet ip-10-0-58-124} Created: Created container pause
Mar 18 10:28:45.166: INFO: At 2022-03-18 10:26:51 +0000 UTC - event for taint-eviction-b1: {kubelet ip-10-0-58-124} Started: Started container pause
Mar 18 10:28:45.166: INFO: At 2022-03-18 10:26:51 +0000 UTC - event for taint-eviction-b2: {kubelet ip-10-0-58-124} Started: Started container pause
Mar 18 10:28:45.166: INFO: At 2022-03-18 10:26:51 +0000 UTC - event for taint-eviction-b2: {kubelet ip-10-0-58-124} Pulled: Container image "k8s.gcr.io/pause:3.6" already present on machine
Mar 18 10:28:45.166: INFO: At 2022-03-18 10:26:51 +0000 UTC - event for taint-eviction-b2: {kubelet ip-10-0-58-124} Created: Created container pause
Mar 18 10:28:45.166: INFO: At 2022-03-18 10:26:58 +0000 UTC - event for taint-eviction-b1: {taint-controller } TaintManagerEviction: Marking for deletion Pod taint-multiple-pods-5037/taint-eviction-b1
Mar 18 10:28:45.166: INFO: At 2022-03-18 10:26:58 +0000 UTC - event for taint-eviction-b1: {kubelet ip-10-0-58-124} Killing: Stopping container pause
Mar 18 10:28:45.166: INFO: At 2022-03-18 10:27:03 +0000 UTC - event for taint-eviction-b1: {kubelet ip-10-0-58-124} FailedKillPod: error killing pod: failed to "KillPodSandbox" for "a7fa4a6f-f477-43f8-af43-821734900ee8" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to destroy network for sandbox \"9676827c763d10f18114d0666c597de32f3a9c1d1efd1741cf61a901e3a74f2b\": connection is unauthorized: Unauthorized"
Mar 18 10:28:45.166: INFO: At 2022-03-18 10:27:04 +0000 UTC - event for taint-eviction-b1: {kubelet ip-10-0-58-124} FailedKillPod: error killing pod: failed to "KillPodSandbox" for "a7fa4a6f-f477-43f8-af43-821734900ee8" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to destroy network for sandbox \"9676827c763d10f18114d0666c597de32f3a9c1d1efd1741cf61a901e3a74f2b\": error getting ClusterInformation: connection is unauthorized: Unauthorized"
Mar 18 10:28:45.166: INFO: At 2022-03-18 10:27:18 +0000 UTC - event for taint-eviction-b2: {kubelet ip-10-0-58-124} Killing: Stopping container pause
Mar 18 10:28:45.166: INFO: At 2022-03-18 10:27:18 +0000 UTC - event for taint-eviction-b2: {taint-controller } TaintManagerEviction: Marking for deletion Pod taint-multiple-pods-5037/taint-eviction-b2
Mar 18 10:28:45.167: INFO: At 2022-03-18 10:27:19 +0000 UTC - event for taint-eviction-b2: {kubelet ip-10-0-58-124} FailedKillPod: error killing pod: failed to "KillPodSandbox" for "e9a4ef09-2940-43b0-bf18-368d4d9bd77e" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to destroy network for sandbox \"06daa4725d4984da00dbc56b4ebc9695897545b15fabf31569267b5fd22d5a8d\": error getting ClusterInformation: connection is unauthorized: Unauthorized"
Mar 18 10:28:45.167: INFO: At 2022-03-18 10:28:41 +0000 UTC - event for taint-eviction-b1: {taint-controller } TaintManagerEviction: Marking for deletion Pod taint-multiple-pods-5037/taint-eviction-b1
Mar 18 10:28:45.167: INFO: At 2022-03-18 10:28:44 +0000 UTC - event for taint-eviction-b2: {taint-controller } TaintManagerEviction: Cancelling deletion of Pod taint-multiple-pods-5037/taint-eviction-b2
Mar 18 10:28:45.340: INFO: POD                NODE            PHASE    GRACE  CONDITIONS
Mar 18 10:28:45.341: INFO: taint-eviction-b1  ip-10-0-58-124  Running  30s    [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2022-03-18 10:26:50 +0000 UTC  } {Ready False 0001-01-01 00:00:00 +0000 UTC 2022-03-18 10:27:04 +0000 UTC ContainersNotReady containers with unready status: [pause]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2022-03-18 10:27:04 +0000 UTC ContainersNotReady containers with unready status: [pause]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2022-03-18 10:26:50 +0000 UTC  }]
Mar 18 10:28:45.342: INFO: taint-eviction-b2  ip-10-0-58-124  Running  30s    [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2022-03-18 10:26:50 +0000 UTC  } {Ready False 0001-01-01 00:00:00 +0000 UTC 2022-03-18 10:27:20 +0000 UTC ContainersNotReady containers with unready status: [pause]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2022-03-18 10:27:20 +0000 UTC ContainersNotReady containers with unready status: [pause]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2022-03-18 10:26:50 +0000 UTC  }]

Looking at calico-node logs from that node, I see that the calico-node pod has a start time that is after the end of the test. I wonder if the problem is that calico-node doesn't tolerate the taint that's being added and is killed?

lwr20 commented 2 years ago

If I manually add the same taint that the test uses (kubernetes.io/e2e-evict-taint-key=evictTaintVal:NoExecute), I do indeed see the calico-node pod on that node disappear.

lwr20 commented 2 years ago

If I add a "tolerate everything" to the calico-node daemonset:

tolerations:
- operator: "Exists"

and re-run the e2e test, I see the test pass:

------------------------------
[sig-node] NoExecuteTaintManager Multiple Pods [Serial] 
  evicts pods with minTolerationSeconds [Disruptive] [Conformance]
  /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:630
[BeforeEach] [sig-node] NoExecuteTaintManager Multiple Pods [Serial]
  /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:185
STEP: Creating a kubernetes client
Mar 18 11:33:14.984: INFO: >>> kubeConfig: /tmp/kubeconfig-845565767
STEP: Building a namespace api object, basename taint-multiple-pods
W0318 11:33:15.041870      19 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
Mar 18 11:33:15.042: INFO: No PodSecurityPolicies found; assuming PodSecurityPolicy is disabled.
STEP: Waiting for a default service account to be provisioned in namespace
[BeforeEach] [sig-node] NoExecuteTaintManager Multiple Pods [Serial]
  /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/node/taints.go:345
Mar 18 11:33:15.064: INFO: Waiting up to 1m0s for all nodes to be ready
Mar 18 11:34:15.121: INFO: Waiting for terminating namespaces to be deleted...
[It] evicts pods with minTolerationSeconds [Disruptive] [Conformance]
  /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:630
Mar 18 11:34:15.148: INFO: Starting informer...
STEP: Starting pods...
Mar 18 11:34:15.454: INFO: Pod1 is running on ip-10-0-58-124. Tainting Node
Mar 18 11:34:19.703: INFO: Pod2 is running on ip-10-0-58-124. Tainting Node
STEP: Trying to apply a taint on the Node
STEP: verifying the node has the taint kubernetes.io/e2e-evict-taint-key=evictTaintVal:NoExecute
STEP: Waiting for Pod1 and Pod2 to be deleted
Mar 18 11:34:26.063: INFO: Noticed Pod "taint-eviction-b1" gets evicted.
Mar 18 11:34:46.172: INFO: Noticed Pod "taint-eviction-b2" gets evicted.
STEP: verifying the node doesn't have the taint kubernetes.io/e2e-evict-taint-key=evictTaintVal:NoExecute
[AfterEach] [sig-node] NoExecuteTaintManager Multiple Pods [Serial]
  /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:186
Mar 18 11:34:46.206: INFO: Waiting up to 3m0s for all (but 0) nodes to be ready
STEP: Destroying namespace "taint-multiple-pods-7930" for this suite.

• [SLOW TEST:91.247 seconds]
[sig-node] NoExecuteTaintManager Multiple Pods [Serial]
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/node/framework.go:23
  evicts pods with minTolerationSeconds [Disruptive] [Conformance]
  /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:630
------------------------------
{"msg":"PASSED [sig-node] NoExecuteTaintManager Multiple Pods [Serial] evicts pods with minTolerationSeconds [Disruptive] [Conformance]","total":1,"completed":1,"skipped":31,"failed":0}
lwr20 commented 2 years ago

@dghubble So I think the problem lies with the calico manifest - not tolerating the taint that that test uses.

Where does Typhoon get the calico manifest from? Does it vendor the manifest or get it from the calico docs?

Oh - unless there's a way to ensure that the CNIs kubeconfig doesn't get deleted when calico-node gets killed? @caseydavenport ?

lwr20 commented 2 years ago

Just realised that my sonobuoy run doesn't use the same settings as the OP provided.

sonobuoy run --e2e-focus="NoExecuteTaintManager Multiple Pods" --e2e-skip="" --plugin-env=e2e.E2E_EXTRA_ARGS="--non-blocking-taints=node-role.kubernetes.io/controller"

Running that with the "tolerate everything" setting - both tests pass.

caseydavenport commented 2 years ago

that the CNIs kubeconfig doesn't get deleted when calico-node gets killed?

Hm, I didn't think that deleting the calico/node pod would remove the CNI kubeconfig. At least I don't think anything in Calico does that.

It might be that the token is invalidated and since calico/node isn't running it can't update the config on the host?

lwr20 commented 2 years ago

It might be that the token is invalidated and since calico/node isn't running it can't update the config on the host?

How often does the token get cycled? Because we hit this every time when I run the test (without the toleration). For that behaviour, the token cycling time would have to be ~1 min

lwr20 commented 2 years ago

From https://kubernetes.slack.com/archives/C0EN96KUY/p1647856984220609, it appears that kubernetes actively revokes service account credentials when pods are deleted.

lwr20 commented 2 years ago

Ah interesting. I found the vendored calico manifests in https://github.com/poseidon/terraform-render-bootstrap/blob/master/resources/calico/daemonset.yaml

They have:

      tolerations:
      - key: node-role.kubernetes.io/controller
        operator: Exists
      - key: node.kubernetes.io/not-ready
        operator: Exists
      %{~ for key in daemonset_tolerations ~}
      - key: ${key}
        operator: Exists
      %{~ endfor ~}

Whereas https://docs.projectcalico.org/manifests/calico.yaml has:

      tolerations:
        # Make sure calico-node gets scheduled on all nodes.
        - effect: NoSchedule
          operator: Exists
        # Mark the pod as a critical add-on for rescheduling.
        - key: CriticalAddonsOnly
          operator: Exists
        - effect: NoExecute
          operator: Exists

@dghubble Perhaps the Typhoon manifest needs to add

        - effect: NoSchedule
          operator: Exists
dghubble commented 2 years ago

Thanks for looking into this folks!

I know CNI's docs often show an "allow everywhere" toleration (i.e. operator: Exists without a key). However, we can't ship that. Clusters support many platforms (clouds, bare-metal) and support heterogeneous nodes with different properties (e.g. worker pools with different OSes, Arch, resources, hardware, etc).

Choosing on behalf of users that a Calico DaemonSet should be on ALL nodes would limit use cases. For example,

tolerations:
        # Make sure calico-node gets scheduled on all nodes.   <- Good for simple clusters (90% use case)
        - effect: NoSchedule
          operator: Exists
        # Mark the pod as a critical add-on for rescheduling.      <- Deprecated
        - key: CriticalAddonsOnly
          operator: Exists
        - effect: NoExecute                                                          <- Good for simple clusters
          operator: Exists

Instead, Typhoon allows kube-system DaemonSet tolerations to be configured, to support those more advanced cases. Here's one example (though Typhoon doesn't support ARM64 if Calico is chosen).

 tolerations:
      - key: node-role.kubernetes.io/controller
        operator: Exists
      - key: node.kubernetes.io/not-ready
        operator: Exists
      %{~ for key in daemonset_tolerations ~}
      - key: ${key}
        operator: Exists
      %{~ endfor ~}

From your investigation, it sounds like having this conformance test pass will require listing what those expected taints are, and provisioning the cluster so that Calico tolerates them. I suppose the reason Cilium and flannel don't hit this is because they're not relying on credentials in the same way.

lwr20 commented 2 years ago

Clusters with x86 and arm64 nodes, Calico does not ship a typical multi-arch container image (it ships an image per architecture, which is different and requires DaemonSets matching a subset of nodes)

I don't think that's true any more? @caseydavenport could probably confirm.

caseydavenport commented 2 years ago

Calico does not ship a typical multi-arch container image (it ships an image per architecture, which is different and requires DaemonSets matching a subset of nodes)

Yep, this one at least is no longer the case (manifests are multi-arch now).

I suppose the reason Cilium and flannel don't hit this is because they're not relying on credentials in the same way.

Yeah, this would only be hit if the CNI plugin on the host needs to make API calls and is doing so using the serviceaccount token of the daemonset pod that installed it.

One option here might be to use the TokenRequest API directly to provision a separate token not bound to the life of the calico/node pod.

In general I agree that we can't expect "Tolerate all" to be acceptable for every single cluster in existence, but I do think it is the correct default for what we ship because it will be right for the vast majority of clusters.

Certain cases where you don't want a CNI provider on a set of nodes at all

I believe for use cases like this we have switched to using node affinities rather than taints/tolerations. For example, this node affinity prevents us from running on fargate nodes: https://github.com/tigera/operator/blob/master/pkg/render/node.go#L711

dghubble commented 2 years ago

Awesome to see the multi-arch manifest images. I'll check those out, that'll help remove one case.

I agree, for the vast majority of clusters your example seems great. I wouldn't advocate changing it either.

I may look at node affinities, but having those be conditional is a lot more tricky just with the Terraform logic available to us. And taints do seem to express the situation fairly clearly. I'm not sure affinities would have a different end result (e.g. Kubernetes E2E might have an equivalent test there that gets thrown off in the same way).

Thanks to your investigation @lwr20, this looks like a detail of how CNCF conformance tests run, you'd agree? That wouldn't affect real clusters as far as I can tell. We could just say, if we need to pass conformance, you need to tolerate kubernetes.io/e2e-evict-taint-key https://github.com/kubernetes/kubernetes/blob/master/test/e2e/node/taints.go#L48 in our docs.

module "typhoon-cluster" {
  ...
  networking            = "calico"
  daemonset_tolerations = ["kubernetes.io/e2e-evict-taint-key"]
}
lwr20 commented 2 years ago

this looks like a detail of how CNCF conformance tests run, you'd agree?

I agree. I certainly don't think that particular conformance test is intended to mandate that "CNIs must work without talking to the kubernetes apiserver" for example.

One option here might be to use the TokenRequest API directly to provision a separate token not bound to the life of the calico/node pod.

Of course if we could do this, that would be ideal. But will need some prototyping and testing of course.

dghubble commented 2 years ago

Adding the DaemonSet toleration for kubernetes.io/e2e-evict-taint-key gets this conformance test passing for me as well. I can update conformance testing nodes and go ahead and close this issue if that's alright.

One option here might be to use the TokenRequest API directly to provision a separate token not bound to the life of the calico/node pod.

It would be nice to not hit this if that's a reasonable thing to do. Presumably that would be a separate issue if its desired.

ScheererJ commented 2 years ago

@caseydavenport We have run into this issue in one of our clusters in a slightly different scenario. For bin packing reasons, we scale the resource requests of calico-node vertically in a cluster-proportional manner (pretty similar to https://github.com/kubernetes/kubernetes/tree/master/cluster/addons/calico-policy-controller). As the cluster grew in size, calico-node was supposed to be recreated with bigger memory requests, i.e. the existing pod was deleted and a new one with higher memory requests being created. However, as the node was nearly fully loaded there was not enough space for the new pod. Thanks to the priority class of calico-node pre-emption occurred and the kube-scheduler tried to get rid of a lower priority pod on the node. However, now we ran into the problem that the lower priority pod could not be deleted as network sandbox deletion via CNI fails with this error (error getting ClusterInformation: connection is unauthorized: Unauthorized") as the token in calico's kubeconfig belongs to a deleted pod. The node cannot automatically recover from this as no pod can be completely removed due to the CNI error and calico-node cannot be scheduled as the memory requirements are not fulfilled.

Is there a plan to resolve this issue by for example using the token api directly or otherwise decoupling the validity of the token used for CNI from the calico-node pod lifecycle?

ScheererJ commented 2 years ago

As we would like to have this issue fixed properly, we would like to contribute a solution to this. It seems like the logic is scattered across two place:

Instead of using the token of the calico-node pod, we would propose create a separate token via https://kubernetes.io/docs/reference/kubernetes-api/authentication-resources/token-request-v1/ either bound to no object or bound to the node object. The validity period can be rather small, e.g. 1h. The token would then be replaced with a simple timer based approach.

@caseydavenport Would you be open to such a contribution?

musigma-admin commented 2 years ago

@caseydavenport We have run into this issue in one of our clusters in a slightly different scenario. For bin packing reasons, we scale the resource requests of calico-node vertically in a cluster-proportional manner (pretty similar to https://github.com/kubernetes/kubernetes/tree/master/cluster/addons/calico-policy-controller). As the cluster grew in size, calico-node was supposed to be recreated with bigger memory requests, i.e. the existing pod was deleted and a new one with higher memory requests being created. However, as the node was nearly fully loaded there was not enough space for the new pod. Thanks to the priority class of calico-node pre-emption occurred and the kube-scheduler tried to get rid of a lower priority pod on the node. However, now we ran into the problem that the lower priority pod could not be deleted as network sandbox deletion via CNI fails with this error (error getting ClusterInformation: connection is unauthorized: Unauthorized") as the token in calico's kubeconfig belongs to a deleted pod. The node cannot automatically recover from this as no pod can be completely removed due to the CNI error and calico-node cannot be scheduled as the memory requirements are not fulfilled.

Is there a plan to resolve this issue by for example using the token api directly or otherwise decoupling the validity of the token used for CNI from the calico-node pod lifecycle?

Hi, In our case on a slightly different setting. Even the new nodes with compartively less/no pods have this issue. Even when I had actually formatted and added the machine back We still figure get this CNI issue.

You can find more about this issue here

caseydavenport commented 2 years ago

Instead of using the token of the calico-node pod, we would propose create a separate token via https://kubernetes.io/docs/reference/kubernetes-api/authentication-resources/token-request-v1/ either bound to no object or bound to the node object. The validity period can be rather small, e.g. 1h. The token would then be replaced with a simple timer based approach.

Yes, this is the approach that I was musing on as well. I think it is worth exploring this to see what it would look like and what limitations it might have (hopefully none)

ScheererJ commented 2 years ago

Yes, this is the approach that I was musing on as well. I think it is worth exploring this to see what it would look like and what limitations it might have (hopefully none)

@caseydavenport Should I create a corresponding pull request or do you plan to explore it yourself?

caseydavenport commented 2 years ago

@ScheererJ I'd be happy to review a PR for this if you have one. Otherwise I will take a look at it myself once v3.23 is out the door (so in a couple of weeks).

ScheererJ commented 2 years ago

@caseydavenport Feel free to take a look at #5910 if you have some time to spare. I will be on vacation next week, though. Hence, there is no hurry from my side.