Open plytro opened 3 months ago
We should consider adopting the InClusterConfigLoader in the official Kubernetes pthon client: https://github.com/kubernetes-client/python/blob/master/kubernetes/base/config/incluster_config.py
Or at least something similar that automatically refreshes the token: https://github.com/kubernetes-client/python/blob/392a8c1d0767ce534b121b3b0553e5b1297e430e/kubernetes/base/config/incluster_config.py#L95-L109
Search before asking
KubeRay Component
Others
What happened + What you expected to happen
What happened:
After upgrading to AKS version 1.30 we noted that our head node pods worked for approximately 1 hour and then the autoscaler container in the pod starts to get HTTP 401 responses when querying the kubernetes api. This causes the pod's readiness probe to fail, resulting in the loss off access via the LoadBalancer to the head node.
Through troubleshooting we found the pod definition included this projected volume for the service account token for api access indicating the token has a lifetime of 3607 seconds.
As noted in the AKS 1.30 release notes, service account tokens are no longer given an extended lifetime. By default
I'm not 100% positive I'm reading this code correctly, but it seems like the http client is instantiated one time and reads the token at instantiation and then doesn't account for token expiration with a re-read of the token. We found that if we restart the head node pod the api communication begins working again and successfully makes http calls to the k8s api for 1 hour.
Logs
What you expected to happen
The cluster autoscaler doesn't lose the ability to communicate with the kube api when the token in the projected volume expires and is replaced with a valid token.
Tagging @andrewsykim @kevin85421 per a discussion in the ray slack.
Reproduction script
I'm working with our dev team to get a code sample that we use to create the RayCluster object that gets sent into the cluster. As this is injected into the definition, I'm not sure how useful it may be for this issue.
Anything else
Notes on token lifetime: https://github.com/kubernetes/enhancements/blob/master/keps/sig-auth/1205-bound-service-account-tokens/README.md
These lines of code and the stack trace led me to the python code referenced above: https://github.com/ray-project/kuberay/blob/master/ray-operator/controllers/ray/common/pod.go#L120 https://github.com/ray-project/kuberay/blob/master/ray-operator/controllers/ray/common/pod.go#L396 https://github.com/ray-project/kuberay/blob/master/ray-operator/controllers/ray/common/pod.go#L454
Are you willing to submit a PR?