Open Moonquakes opened 1 month ago
RayJob has improved a lot in KubeRay v1.1.0, so I’m not surprised that there are some stability issues in v1.0.0. However, I am surprised that KubeRay v1.1.1 also has the issue. Would you mind (1) checking the KubeRay v1.1.1 logs to see if there are any logs related to this logic, and (2) providing a simple RayJob YAML so that I can check whether you use the correct config or not?
Search before asking
KubeRay Component
ray-operator
What happened + What you expected to happen
In some cases of kuberay v1.0.0, especially when RayJob requests a lot of resources and takes a long time (more than half an hour), the task will be completed, but the log output is not completed (no normal success information is output, but the end output of the job can be seen in the dashboard). At this time, RayJob will be stuck there and the submitter pod will not be recycled normally.
The status information returned by kuberay is shown in the figure below
After I upgraded the version to v1.1.1, not only the submitter pod was not recycled normally, but the head node was also not recycled. The status was shown as Running in the jobDeploymentStatus field, and nothing else changed
Reproduction script
It is easy to reproduce a RayJob that occupies a lot of resources and takes a long time
Anything else
No response
Are you willing to submit a PR?