Open bibibox opened 2 weeks ago
What is the problem you're trying to solve
Currently, when scheduling a Volcano job, if you want to trigger rescheduling in error scenarios, the only way to retry is through the RestartJob Action. The impact range is sometimes too large.
@bibibox As far as i know, besides
RestartJob
,RestartTask
action has been already supported in volcano. So would you clarify you requirement?
What is the problem you're trying to solve
Currently, when scheduling a Volcano job, if you want to trigger rescheduling in error scenarios, the only way to retry is through the RestartJob Action. The impact range is sometimes too large.
@bibibox As far as i know, besides
RestartJob
,RestartTask
action has been already supported in volcano. So would you clarify you requirement?
Although there is a restartTask action in Volcano’s actions, it is actually not implemented yet and is currently unavailable.
What is the problem you're trying to solve
Currently, when scheduling a Volcano job, if you want to trigger rescheduling in error scenarios, the only way to retry is through the RestartJob Action. The impact range is sometimes too large.
Describe the solution you'd like
It would be desirable to introduce new capabilities, such as individually restarting a specific Pod/Task to minimize the impact.
Furthermore, it could allow for a graded execution. For example, first attempt to restartPod/restartTask. If recovery is not achieved within a certain period, then try restartJob.
Additional context
After implementing the above capability, we can use it as follows:
When Pod A exits with an error, it triggers a PodFailed event, which in turn triggers a RestartTask action, so the Volcano controller will attempt to recreate Pod A. This recreation action will trigger a PodEvicted event, which in turn triggers a scheduled task to RestartJob after 10 minutes.
If Pod A returns to a running state, the delay RestartJob action is canceled, and everything returns to normal. If Pod A still cannot reach the running state after 10 minutes, a broader reconstruction action, which is restartJob in this scene, will be attempted.