In our online environment, we run over 200 rayjobs daily, and sometimes certain rayjobs encounter issues. Typically, users provide us with two types of information:
They inform us that a specific job named xgboost-rayjob-batch-xxx is failing. In such cases, we can use kubectl describe rayjob xgboost-rayjob-batch-xxx to identify the headnode of this rayjob. By logging into the head pod, we can examine relevant logs and resolve the problem.
Occasionally, users report problems with a particular ray cluster. In these cases, I would like to quickly locate the rayjob running on that ray cluster. Currently, I can only filter operator logs to identify the issue.
I believe it would be helpful if we could print the cluster information associated with a rayjob when users execute kubectl get rayjob. This would expedite problem identification.
I plan to add some additional information for rayjobs so that when we execute kubectl get rayjob -o wide, the cluster information can be displayed.
Why are these changes needed?
In our online environment, we run over 200 rayjobs daily, and sometimes certain rayjobs encounter issues. Typically, users provide us with two types of information:
They inform us that a specific job named
xgboost-rayjob-batch-xxx
is failing. In such cases, we can usekubectl describe rayjob xgboost-rayjob-batch-xxx
to identify the headnode of this rayjob. By logging into the head pod, we can examine relevant logs and resolve the problem.Occasionally, users report problems with a particular ray cluster. In these cases, I would like to quickly locate the rayjob running on that ray cluster. Currently, I can only filter operator logs to identify the issue.
I believe it would be helpful if we could print the cluster information associated with a rayjob when users execute
kubectl get rayjob
. This would expedite problem identification.I plan to add some additional information for rayjobs so that when we execute
kubectl get rayjob -o wide
, the cluster information can be displayed.Related issue number
Checks