ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
990 stars 330 forks source link

[RayJob] Add Cluster Name For Rayjob. #2046

Closed slfan1989 closed 3 months ago

slfan1989 commented 3 months ago

Why are these changes needed?

In our online environment, we run over 200 rayjobs daily, and sometimes certain rayjobs encounter issues. Typically, users provide us with two types of information:

They inform us that a specific job named xgboost-rayjob-batch-xxx is failing. In such cases, we can use kubectl describe rayjob xgboost-rayjob-batch-xxx to identify the headnode of this rayjob. By logging into the head pod, we can examine relevant logs and resolve the problem.

Occasionally, users report problems with a particular ray cluster. In these cases, I would like to quickly locate the rayjob running on that ray cluster. Currently, I can only filter operator logs to identify the issue.

I believe it would be helpful if we could print the cluster information associated with a rayjob when users execute kubectl get rayjob. This would expedite problem identification.

I plan to add some additional information for rayjobs so that when we execute kubectl get rayjob -o wide, the cluster information can be displayed.

Related issue number

Checks

slfan1989 commented 3 months ago

@kevin85421 Can you help review this PR? Thank you very much!

slfan1989 commented 3 months ago

@kevin85421 Thank you very much for reviewing the code!