volcano-sh / volcano

A Cloud Native Batch System (Project under CNCF)
https://volcano.sh
Apache License 2.0
4.25k stars 971 forks source link

Support More Actions For Volcano Job Failure Scenario #3812

Open bibibox opened 2 weeks ago

bibibox commented 2 weeks ago

What is the problem you're trying to solve

Currently, when scheduling a Volcano job, if you want to trigger rescheduling in error scenarios, the only way to retry is through the RestartJob Action. The impact range is sometimes too large.

Describe the solution you'd like

It would be desirable to introduce new capabilities, such as individually restarting a specific Pod/Task to minimize the impact.

Furthermore, it could allow for a graded execution. For example, first attempt to restartPod/restartTask. If recovery is not achieved within a certain period, then try restartJob.

Additional context

After implementing the above capability, we can use it as follows:

// other config
spec:
  policies:
  - event: PodFailed
    action: RestartTask
  - event: PodEvicted
    action: RestartJob
    timeout: 10m
// other config

When Pod A exits with an error, it triggers a PodFailed event, which in turn triggers a RestartTask action, so the Volcano controller will attempt to recreate Pod A. This recreation action will trigger a PodEvicted event, which in turn triggers a scheduled task to RestartJob after 10 minutes.

If Pod A returns to a running state, the delay RestartJob action is canceled, and everything returns to normal. If Pod A still cannot reach the running state after 10 minutes, a broader reconstruction action, which is restartJob in this scene, will be attempted.

william-wang commented 2 weeks ago

What is the problem you're trying to solve

Currently, when scheduling a Volcano job, if you want to trigger rescheduling in error scenarios, the only way to retry is through the RestartJob Action. The impact range is sometimes too large.

@bibibox As far as i know, besides RestartJob, RestartTask action has been already supported in volcano. So would you clarify you requirement?

bibibox commented 2 weeks ago

What is the problem you're trying to solve

Currently, when scheduling a Volcano job, if you want to trigger rescheduling in error scenarios, the only way to retry is through the RestartJob Action. The impact range is sometimes too large.

@bibibox As far as i know, besides RestartJob, RestartTask action has been already supported in volcano. So would you clarify you requirement?

Although there is a restartTask action in Volcano’s actions, it is actually not implemented yet and is currently unavailable.