projectcalico / calico

Cloud native networking and network security
https://docs.tigera.io/calico/latest/about/
Apache License 2.0
5.88k stars 1.31k forks source link

calico-node cannot refresh expired serviceaccount token due to apiserver throttling #7694

Open wedaly opened 1 year ago

wedaly commented 1 year ago

Expected Behavior

calico-node is able to refresh the serviceaccount token used by calico CNI

Current Behavior

AKS had a customer report the following errors from Calico CNI: "error getting ClusterInformation: connection is unauthorized: Unauthorized". The errors occurred consistently across multiple nodes for about 30 hours, then resolved without intervention. The cluster had 8 nodes at the time the incident occurred.

Upon investigation, we discovered that:

  1. requests from calico-node to apiserver to create the serviceaccount token were failing due to a 429 response from apiserver. The errors in the calico-node logs looked like:

    2023-05-06 05:25:29.092 [ERROR][87] cni-config-monitor/token_watch.go 131: Failed to update CNI token, retrying... error=the server was unable to return a response in the time allotted, but may still be processing the request (post serviceaccounts calico-node)
  2. we observed that all of the 429 responses were coming from one pod of apiserver (out of six pods running)

  3. apiserver was using API Priority and Fairness and classifying the requests from calico-node as workload-low

Possible Solution

I suspect this might be a new failure mode introduced by https://github.com/projectcalico/calico/pull/5910. In particular, when calico-node instances happen to connect to an overloaded apiserver replica and the CNI token expires, apiserver may throttle the requests to create a new serviceaccount token. This prevents the calico-node from refreshing the token, causing CNI failures.

Steps to Reproduce (for bugs)

I'm not sure how to reproduce this issue. We had a customer report the problem, and it resolved after ~30 hours without intervention.

Context

We had one customer report this issue running AKS-managed Calico.

Your Environment

caseydavenport commented 1 year ago

Oooh, fun.

Sounds like one option here might be to configure FlowSchemas for Calico so that it's not lumped into the "workload-low" category, which is obviously not quite correct for a critical infrastructure component.

I'm not sure we can ship one of those by default in Calico, as it probably will vary by cluster configuration, but perhaps it's something we should add to our documentation.

As for code changes we might make, those are a bit less obvious. Maybe one of these?

wedaly commented 1 year ago

Sounds like one option here might be to configure FlowSchemas for Calico so that it's not lumped into the "workload-low" category, which is obviously not quite correct for a critical infrastructure component.

I'm not sure we can ship one of those by default in Calico, as it probably will vary by cluster configuration, but perhaps it's something we should add to our documentation.

That sounds like a reasonable solution. Agree that apiserver shouldn't be classifying requests from Calico as workload-low.

As for code changes we might make, those are a bit less obvious. Maybe one of these?

  • Increase the validity period of our tokens so that we're less susceptible to transient errors like this.
  • Perhaps there is some timeout we can increase so that if the apiserver is burdened we wait a bit longer?

There are some env vars in client-go that enable exponential backoff https://github.com/kubernetes/kubernetes/blob/3d27dee047a87527735bf74cfcc6b8ff8875f66c/staging/src/k8s.io/client-go/rest/client.go#L36-L37 I'm not completely sure it would have helped in this case, but might be worth exploring.