Open han-steve opened 3 weeks ago
Honestly, I don't think KubeRay should handle and expose K8s Pod errors. You can think of RayCluster as equivalent to multiple ReplicaSets. ReplicaSetStatus doesn't include "Pod failure" in its status. Maybe we can introduce a new conditions
field to handle Pod-level observability. Currently, the RayCluster state includes Failed
, which is quite undefined and makes the state machine rather messy. I am planning to refactor the RayCluster status soon. If you are interested, we can work on it together, or you can provide feedback on my design document.
I'd be happy to help out here in case the help is needed @han-steve!
@MadhavJivrajani Great! I will let you know when I have a doc.
Thanks for the response. I agree that the status state machine can get messy with pod failure statuses. An alternative would be to use the Conditions field to reflect the errors in the underlying cluster. For example, ReplicaSet and Deployment use a Condition to inform a user that the pods fail to scale up due to a resource quota error. They also produce events that can be easily seen with a kubectl describe
.
Our goal is to surface the underlying error to the user so they know if a job is pending or stuck due to resource quota errors. If there's no plan to surface these conditions, we'll query the associated ray cluster for this info to show to the user. Thanks again for taking a look!
I have already worked on a document. I will let you know when it is ready for review.
cc @MortalHappiness
Search before asking
KubeRay Component
ray-operator
What happened + What you expected to happen
When you create a RayJob that is exceeds the resource quota, instead of surfacing this error, the RayJob's status stays at
Initializing
:However, the RayCluster status correctly reflects the fact that the cluster has failed due to resource quota issue:
It will be great if the RayJob status can also accurately reflect the failed state of the underlying RayCluster, such as by setting the
jobDeploymentStatus
toFailed
andmessage
to the error message.Reproduction script
I can contribute an integration test for this, but it'll be too long for here.
Anything else
No response
Are you willing to submit a PR?