ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
1.13k stars 365 forks source link

[Bug] Head node autoscaler container fails to communicate to the kubernetes api with a 401 in Azure Kubernetes 1.30 #2324

Open plytro opened 3 weeks ago

plytro commented 3 weeks ago

Search before asking

KubeRay Component

Others

What happened + What you expected to happen

What happened:

After upgrading to AKS version 1.30 we noted that our head node pods worked for approximately 1 hour and then the autoscaler container in the pod starts to get HTTP 401 responses when querying the kubernetes api. This causes the pod's readiness probe to fail, resulting in the loss off access via the LoadBalancer to the head node.

Through troubleshooting we found the pod definition included this projected volume for the service account token for api access indicating the token has a lifetime of 3607 seconds.

As noted in the AKS 1.30 release notes, service account tokens are no longer given an extended lifetime. By default

I'm not 100% positive I'm reading this code correctly, but it seems like the http client is instantiated one time and reads the token at instantiation and then doesn't account for token expiration with a re-read of the token. We found that if we restart the head node pod the api communication begins working again and successfully makes http calls to the k8s api for 1 hour.

- name: kube-api-access-shgh8
  projected:
    defaultMode: 420
    sources:
    - serviceAccountToken:
        expirationSeconds: 3607
        path: token

Logs

The Ray head is ready. Starting the autoscaler.
  File "/opt/app-root/.conda/envs/env/bin/ray", line 11, in <module>
    sys.exit(main())
             ^^^^^^
  File "/opt/app-root/.conda/envs/env/lib/python3.11/site-packages/ray/scripts/scripts.py", line 2615, in main
    return cli()
           ^^^^^
  File "/opt/app-root/.conda/envs/env/lib/python3.11/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/.conda/envs/env/lib/python3.11/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/opt/app-root/.conda/envs/env/lib/python3.11/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/.conda/envs/env/lib/python3.11/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/.conda/envs/env/lib/python3.11/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/.conda/envs/env/lib/python3.11/site-packages/ray/scripts/scripts.py", line 2338, in kuberay_autoscaler
    run_kuberay_autoscaler(cluster_name, cluster_namespace)
  File "/opt/app-root/.conda/envs/env/lib/python3.11/site-packages/ray/autoscaler/_private/kuberay/run_autoscaler.py", line 86, in run_kuberay_autoscaler
    ).run()
      ^^^^^
  File "/opt/app-root/.conda/envs/env/lib/python3.11/site-packages/ray/autoscaler/_private/monitor.py", line 584, in run
    self._run()
  File "/opt/app-root/.conda/envs/env/lib/python3.11/site-packages/ray/autoscaler/_private/monitor.py", line 389, in _run
    self.autoscaler.update()
  File "/opt/app-root/.conda/envs/env/lib/python3.11/site-packages/ray/autoscaler/_private/autoscaler.py", line 384, in update
    raise e
  File "/opt/app-root/.conda/envs/env/lib/python3.11/site-packages/ray/autoscaler/_private/autoscaler.py", line 377, in update
    self._update()
  File "/opt/app-root/.conda/envs/env/lib/python3.11/site-packages/ray/autoscaler/_private/autoscaler.py", line 400, in _update
    self.non_terminated_nodes = NonTerminatedNodes(self.provider)
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/.conda/envs/env/lib/python3.11/site-packages/ray/autoscaler/_private/autoscaler.py", line 124, in __init__
    self.all_node_ids = provider.non_terminated_nodes({})
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/.conda/envs/env/lib/python3.11/site-packages/ray/autoscaler/batching_node_provider.py", line 162, in non_terminated_nodes
    self.node_data_dict = self.get_node_data()
                          ^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/.conda/envs/env/lib/python3.11/site-packages/ray/autoscaler/_private/kuberay/node_provider.py", line 333, in get_node_data
    self._raycluster = self._get(f"rayclusters/{self.cluster_name}")
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/.conda/envs/env/lib/python3.11/site-packages/ray/autoscaler/_private/kuberay/node_provider.py", line 519, in _get
    return self.k8s_api_client.get(path)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/.conda/envs/env/lib/python3.11/site-packages/ray/autoscaler/_private/kuberay/node_provider.py", line 273, in get
    result.raise_for_status()
  File "/opt/app-root/.conda/envs/env/lib/python3.11/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://kubernetes.default:443/apis/ray.io/v1/namespaces/sandbox-plytro/rayclusters/1v1faq9sg7fsf2pcg4sqxs4er-0-raycluster-h662v

What you expected to happen

The cluster autoscaler doesn't lose the ability to communicate with the kube api when the token in the projected volume expires and is replaced with a valid token.

Tagging @andrewsykim @kevin85421 per a discussion in the ray slack.

Reproduction script

I'm working with our dev team to get a code sample that we use to create the RayCluster object that gets sent into the cluster. As this is injected into the definition, I'm not sure how useful it may be for this issue.

Anything else

Notes on token lifetime: https://github.com/kubernetes/enhancements/blob/master/keps/sig-auth/1205-bound-service-account-tokens/README.md

These lines of code and the stack trace led me to the python code referenced above: https://github.com/ray-project/kuberay/blob/master/ray-operator/controllers/ray/common/pod.go#L120 https://github.com/ray-project/kuberay/blob/master/ray-operator/controllers/ray/common/pod.go#L396 https://github.com/ray-project/kuberay/blob/master/ray-operator/controllers/ray/common/pod.go#L454

Are you willing to submit a PR?

andrewsykim commented 3 weeks ago

We should consider adopting the InClusterConfigLoader in the official Kubernetes pthon client: https://github.com/kubernetes-client/python/blob/master/kubernetes/base/config/incluster_config.py

Or at least something similar that automatically refreshes the token: https://github.com/kubernetes-client/python/blob/392a8c1d0767ce534b121b3b0553e5b1297e430e/kubernetes/base/config/incluster_config.py#L95-L109