volcano-sh / volcano

A Cloud Native Batch System (Project under CNCF)
https://volcano.sh
Apache License 2.0
4.24k stars 971 forks source link

task maxRetry vs job maxRetry for spot instances #2837

Open yeenow123 opened 1 year ago

yeenow123 commented 1 year ago

What happened:

We are running a Volcano job with one task on AWS Spot Instances with the following retry setting

1 task per job

job.maxRetry = 1
task.maxRetry = 3

The pod gets scheduled onto a node. The node (being on spot) goes away during the execution of the pod. It appears the pod is retried 3 more times on the same node which no longer exists, I'd expect it to retry on another node. The log I am looking at: volcano-controllers log in rapid succession:

Finished Job [REDACTED] killing, current version 4

What you expected to happen:

The pod would be scheduled on another node to execute.

Is the pod expected to be rescheduled on a different node after the first failure? Should I be using job maxRetry?

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

stale[bot] commented 1 year ago

Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).