ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
1.2k stars 389 forks source link

[Bug] RayJob does not shut down the submitter pod properly #2359

Open Moonquakes opened 1 month ago

Moonquakes commented 1 month ago

Search before asking

KubeRay Component

ray-operator

What happened + What you expected to happen

In some cases of kuberay v1.0.0, especially when RayJob requests a lot of resources and takes a long time (more than half an hour), the task will be completed, but the log output is not completed (no normal success information is output, but the end output of the job can be seen in the dashboard). At this time, RayJob will be stuck there and the submitter pod will not be recycled normally.

The status information returned by kuberay is shown in the figure below img_v3_02ef_97fbe77b-d958-4ebb-929c-31daf282b13g

After I upgraded the version to v1.1.1, not only the submitter pod was not recycled normally, but the head node was also not recycled. The status was shown as Running in the jobDeploymentStatus field, and nothing else changed

Reproduction script

It is easy to reproduce a RayJob that occupies a lot of resources and takes a long time

Anything else

No response

Are you willing to submit a PR?

kevin85421 commented 1 month ago

RayJob has improved a lot in KubeRay v1.1.0, so I’m not surprised that there are some stability issues in v1.0.0. However, I am surprised that KubeRay v1.1.1 also has the issue. Would you mind (1) checking the KubeRay v1.1.1 logs to see if there are any logs related to this logic, and (2) providing a simple RayJob YAML so that I can check whether you use the correct config or not?