Podgroup state changed from running to inqueue after pod deleted

shinytang6 commented 2 years ago

What happened:

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?: job yaml:

apiVersion: batch.paddlepaddle.org/v1
kind: PaddleJob
metadata:
  name: wide-ande-deep2
spec:
  cleanPodPolicy: OnCompletion
  withGloo: 1
  worker:
    replicas: 1
    template:
      spec:
        schedulerName: volcano
        containers:
          - name: paddle
            image: registry.baidubce.com/paddle-operator/demo-wide-and-deep:v1
  ps:
    replicas: 1
    template:
      spec:
        schedulerName: volcano
        containers:
          - name: paddle
            image: registry.baidubce.com/paddle-operator/demo-wide-and-deep:v1

In this case(cleanPodPolicy=OnCompletion)，when the pod is completed, the pods will be deleted by paddlejob controller. The PodGroup status transition will be like Inqueue => Running => Pending(all the pods deleted) => Inqueue(pass enqueue action again)(Then remains Inqueue all the time), it will result in unnecessary resource occupation.

related logic:

func jobStatus(ssn *Session, jobInfo *api.JobInfo) scheduling.PodGroupStatus {
    status := jobInfo.PodGroup.Status

    unschedulable := false
    for _, c := range status.Conditions {
        if c.Type == scheduling.PodGroupUnschedulableType &&
            c.Status == v1.ConditionTrue &&
            c.TransitionID == string(ssn.UID) {
            unschedulable = true
            break
        }
    }

    // If running tasks && unschedulable, unknown phase
    if len(jobInfo.TaskStatusIndex[api.Running]) != 0 && unschedulable {
        status.Phase = scheduling.PodGroupUnknown
    } else {
        allocated := 0
        for status, tasks := range jobInfo.TaskStatusIndex {
            if api.AllocatedStatus(status) || status == api.Succeeded {
                allocated += len(tasks)
            }
        }

        // If there're enough allocated resource, it's running
        if int32(allocated) >= jobInfo.PodGroup.Spec.MinMember {
            status.Phase = scheduling.PodGroupRunning
        } else if jobInfo.PodGroup.Status.Phase != scheduling.PodGroupInqueue {
                         // here PodGroup status converts from Running to Pending
            status.Phase = scheduling.PodGroupPending
        }
    }

    status.Running = int32(len(jobInfo.TaskStatusIndex[api.Running]))
    status.Failed = int32(len(jobInfo.TaskStatusIndex[api.Failed]))
    status.Succeeded = int32(len(jobInfo.TaskStatusIndex[api.Succeeded]))

    return status
}

Environment:

Volcano Version: latest image
Kubernetes version (use kubectl version):
Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Others:

shinytang6 commented 2 years ago

My workaround is if podgroup is unschedulable and current state is Running, we convert it to Unknown while not becomes Pending then enter into enqueue action.

stale[bot] commented 2 years ago

Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

stale[bot] commented 2 years ago

Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

stale[bot] commented 1 year ago

Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗

bood commented 4 weeks ago

I came to similar issue when adopting volcano in our project. I think there is a bug about the state change logic of jobStatus function:

When there're not enough resources, pg should fall back from Inqueue to Pending according to state change of design doc delay-pod-creation The else if jobInfo.PodGroup.Status.Phase != scheduling.PodGroupInqueue condition seems to be reverted now. If this is changed to follow the design doc, Running state will not be changed to Pending state in your case
PodGroupCompleted also help in this case if pods are completed successfully, which is introduced in PR #2667 . But I think the pod error case is missed here.

I can make a PR if it makes sense. @shinytang6

volcano-sh / volcano

Podgroup state changed from running to inqueue after pod deleted #2208