[X] I had searched in the issues and found no similar feature requirement.
Description
I would like errors relating to particular ray clusters to be exposed as events relating to that cluster, not just as log items on the operator pod. For example, we recently had a cluster fail to be created with the following log line:
2024-06-12T16:04:27.127Z ERROR controllers.RayCluster If users specify ServiceAccountName for the head Pod, they need to create a ServiceAccount themselves. However, ServiceAccount foo is not found. Please create one. See the PR description of https://github.com/ray-project/kuberay/pull/1128 for more details. {"ServiceAccount": "data-processing/foo", "error": "ServiceAccount \"foo\" not found"}
However, when doing kubectl describe raycluster does not show any description of this issue:
<snip>
Status:
Head:
State: failed
Events: <none>
It would be useful for debugging if the Events field had some diagnostic about why the cluster entered the failed state.
Use case
Sometimes, a ray cluster can fail to be created due to a misconfiguration. It would be useful to see the reason for this failure when describing the ray cluster, rather than reading the operator logs.
This way, users can more directly find errors relevant to their cluster, rather than having to comb through all of the operator's logs. Also, users don't need to be given access to logs in the ray-system namespace.
Search before asking
Description
I would like errors relating to particular ray clusters to be exposed as events relating to that cluster, not just as log items on the operator pod. For example, we recently had a cluster fail to be created with the following log line:
However, when doing
kubectl describe raycluster
does not show any description of this issue:It would be useful for debugging if the
Events
field had some diagnostic about why the cluster entered thefailed
state.Use case
Sometimes, a ray cluster can fail to be created due to a misconfiguration. It would be useful to see the reason for this failure when describing the ray cluster, rather than reading the operator logs.
This way, users can more directly find errors relevant to their cluster, rather than having to comb through all of the operator's logs. Also, users don't need to be given access to logs in the
ray-system
namespace.Related issues
No response
Are you willing to submit a PR?