ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
963 stars 328 forks source link

[Feature] Display reconcile failures as events on ray clusters #2189

Open kwohlfahrt opened 2 weeks ago

kwohlfahrt commented 2 weeks ago

Search before asking

Description

I would like errors relating to particular ray clusters to be exposed as events relating to that cluster, not just as log items on the operator pod. For example, we recently had a cluster fail to be created with the following log line:

2024-06-12T16:04:27.127Z    ERROR   controllers.RayCluster  If users specify ServiceAccountName for the head Pod, they need to create a ServiceAccount themselves. However, ServiceAccount foo is not found. Please create one. See the PR description of https://github.com/ray-project/kuberay/pull/1128 for more details.    {"ServiceAccount": "data-processing/foo", "error": "ServiceAccount \"foo\" not found"}

However, when doing kubectl describe raycluster does not show any description of this issue:

<snip>
Status:
  Head:
  State:  failed
Events:   <none>

It would be useful for debugging if the Events field had some diagnostic about why the cluster entered the failed state.

Use case

Sometimes, a ray cluster can fail to be created due to a misconfiguration. It would be useful to see the reason for this failure when describing the ray cluster, rather than reading the operator logs.

This way, users can more directly find errors relevant to their cluster, rather than having to comb through all of the operator's logs. Also, users don't need to be given access to logs in the ray-system namespace.

Related issues

No response

Are you willing to submit a PR?