ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
963 stars 328 forks source link

[Bug] RayJob should surface errors with underlying RayCluster #2182

Open han-steve opened 3 weeks ago

han-steve commented 3 weeks ago

Search before asking

KubeRay Component

ray-operator

What happened + What you expected to happen

When you create a RayJob that is exceeds the resource quota, instead of surfacing this error, the RayJob's status stays at Initializing:

status:
  jobDeploymentStatus: Initializing
  jobId: rayjob-test2-krlp9
  rayClusterName: rayjob-test2-raycluster-bsh5z
  rayClusterStatus:
    desiredCPU: "0"
    desiredGPU: "0"
    desiredMemory: "0"
    desiredTPU: "0"
    head: {}

However, the RayCluster status correctly reflects the fact that the cluster has failed due to resource quota issue:

status:
  desiredCPU: "0"
  desiredGPU: "0"
  desiredMemory: "0"
  desiredTPU: "0"
  head: {}
  reason: 'pods "rayjob-test2-raycluster-bsh5z-worker-small-wg-mh9v2" is forbidden:
    exceeded quota: low-resource-quota, requested: limits.cpu=200m,limits.memory=256Mi,
    used: limits.cpu=0,limits.memory=0, limited: limits.cpu=100m,limits.memory=107374182400m'
  state: failed

It will be great if the RayJob status can also accurately reflect the failed state of the underlying RayCluster, such as by setting the jobDeploymentStatus to Failed and message to the error message.

Reproduction script

I can contribute an integration test for this, but it'll be too long for here.

Anything else

No response

Are you willing to submit a PR?

kevin85421 commented 3 weeks ago

Honestly, I don't think KubeRay should handle and expose K8s Pod errors. You can think of RayCluster as equivalent to multiple ReplicaSets. ReplicaSetStatus doesn't include "Pod failure" in its status. Maybe we can introduce a new conditions field to handle Pod-level observability. Currently, the RayCluster state includes Failed, which is quite undefined and makes the state machine rather messy. I am planning to refactor the RayCluster status soon. If you are interested, we can work on it together, or you can provide feedback on my design document.

MadhavJivrajani commented 3 weeks ago

I'd be happy to help out here in case the help is needed @han-steve!

kevin85421 commented 3 weeks ago

@MadhavJivrajani Great! I will let you know when I have a doc.

han-steve commented 2 weeks ago

Thanks for the response. I agree that the status state machine can get messy with pod failure statuses. An alternative would be to use the Conditions field to reflect the errors in the underlying cluster. For example, ReplicaSet and Deployment use a Condition to inform a user that the pods fail to scale up due to a resource quota error. They also produce events that can be easily seen with a kubectl describe.

Our goal is to surface the underlying error to the user so they know if a job is pending or stuck due to resource quota errors. If there's no plan to surface these conditions, we'll query the associated ray cluster for this info to show to the user. Thanks again for taking a look!

kevin85421 commented 2 weeks ago

I have already worked on a document. I will let you know when it is ready for review.

kevin85421 commented 1 week ago

cc @MortalHappiness