Open yeenow123 opened 1 year ago
Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).
What happened:
We are running a Volcano job with one task on AWS Spot Instances with the following retry setting
The pod gets scheduled onto a node. The node (being on spot) goes away during the execution of the pod. It appears the pod is retried 3 more times on the same node which no longer exists, I'd expect it to retry on another node. The log I am looking at:
volcano-controllers
log in rapid succession:Finished Job [REDACTED] killing, current version 4
What you expected to happen:
The pod would be scheduled on another node to execute.
Is the pod expected to be rescheduled on a different node after the first failure? Should I be using job maxRetry?
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Environment:
kubectl version
): 1.24.1uname -a
):