Closed shinytang6 closed 1 year ago
My workaround is if podgroup is unschedulable and current state is Running, we convert it to Unknown while not becomes Pending then enter into enqueue action.
Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).
Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).
Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗
I came to similar issue when adopting volcano in our project. I think there is a bug about the state change logic of jobStatus
function:
When there're not enough resources, pg should fall back from Inqueue to Pending according to state change of design doc delay-pod-creation
The else if jobInfo.PodGroup.Status.Phase != scheduling.PodGroupInqueue
condition seems to be reverted now. If this is changed to follow the design doc, Running state will not be changed to Pending state in your case
PodGroupCompleted
also help in this case if pods are completed successfully, which is introduced in PR #2667 . But I think the pod error case is missed here.
I can make a PR if it makes sense. @shinytang6
What happened:
What you expected to happen:
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?: job yaml:
In this case(cleanPodPolicy=OnCompletion),when the pod is completed, the pods will be deleted by paddlejob controller. The PodGroup status transition will be like Inqueue => Running => Pending(all the pods deleted) => Inqueue(pass enqueue action again)(Then remains Inqueue all the time), it will result in unnecessary resource occupation.
related logic:
Environment:
kubectl version
):uname -a
):