Closed ByronHsu closed 6 months ago
You can refer to activeDeadlineSeconds.
Close this issue. Feel free to reopen it if my comment doesn't resolve your question.
I understand that activeDeadlineSeconds is a time limit for the only completion/termination of the Rayjob JobDeploymentStatus.
When Resource scarcity, Rayjob JobDeploymentStatus can be 'Running'
Is there a other options to time limit or make error for when a worker pod lack of resources? worker message: 'Error: No available node types can fulfill resource request {'CPU': 6.0}. Add suitable node types to this cluster to resolve this issue.'
@cmg7111 how about using a gang scheduler e.g. Kueue?
Search before asking
KubeRay Component
ray-operator
What happened + What you expected to happen
Under X < Y, I run ray task with X cpus on Y cpus ray cluster. Ray task keeps throwing warnings but doesn't fail ray job after 2 hours. I am uncertain if there is any default timeout for that.
The expected behavior is that it should timeout and fail with clear error message after a adjustable period.
Reproduction script
Anything else
No response
Are you willing to submit a PR?