project-codeflare / multi-cluster-app-dispatcher

Holistic job manager on Kubernetes
Apache License 2.0
108 stars 63 forks source link

preemption behavior at scale #347

Open asm582 opened 1 year ago

asm582 commented 1 year ago

When queuejobstate and State do not agree any AW that has completed pods is preempted. Find completed pods tied to an AW is flaky at scale when thousands are AWs are preempted. Here is an example:

Status:
  Conditions:
    Last Transition Micro Time:  2023-05-04T12:45:41.619338Z
    Last Update Micro Time:      2023-05-04T12:45:41.619337Z
    Status:                      True
    Type:                        Init
    Last Transition Micro Time:  2023-05-04T12:49:48.450993Z
    Last Update Micro Time:      2023-05-04T12:49:48.450993Z
    Reason:                      AwaitingHeadOfLine
    Status:                      True
    Type:                        Queueing
    Last Transition Micro Time:  2023-05-04T12:49:57.520949Z
    Last Update Micro Time:      2023-05-04T12:49:57.520949Z
    Reason:                      FrontOfQueue.
    Status:                      True
    Type:                        HeadOfLine
    Last Transition Micro Time:  2023-05-04T13:20:36.238921Z
    Last Update Micro Time:      2023-05-04T13:20:36.238921Z
    Reason:                      AppWrapperRunnable
    Status:                      True
    Type:                        Dispatched
    Last Transition Micro Time:  2023-05-04T14:31:41.339363Z
    Last Update Micro Time:      2023-05-04T14:31:41.339363Z
    Reason:                      MinPodsNotRunning
    Status:                      True
    Type:                        PreemptCandidate
    Last Transition Micro Time:  2023-05-04T13:23:19.200744Z
    Last Update Micro Time:      2023-05-04T13:23:19.200744Z
    Message:                     Insufficient number of Running and Completed pods, minimum=1, running=0, completed=0.
    Reason:                      PreemptionTriggered
    Status:                      True
    Type:                        Backoff
    Last Transition Micro Time:  2023-05-04T13:27:19.850690Z
    Last Update Micro Time:      2023-05-04T13:27:19.850690Z
    Reason:                      PreemptionTriggered
    Status:                      True
    Type:                        Backoff
  Controllerfirsttimestamp:      2023-05-04T12:33:37.708915Z
  Filterignore:                  true
  Queuejobstate:                 Backoff
  Sender:                        before PreemptQueueJobs - CanRun: false
  State:                         Running
  Systempriority:                9
metalcycling commented 1 year ago

Are you sure the mismatch is the problem? The fact that this line Insufficient number of Running and Completed pods, minimum=1, running=0, completed=0. is showing completed=0 is why the completed AppWrapper is being requeued. This was the issue we had yesterday. So are you seeing that when the statuses are not the same, the AppWrapper losses the count of pods running or completed?

asm582 commented 1 year ago

I think the jobs here are short sleep 10, when they complete very quickly and at scale the updates are not propagated accurately, causing preemption thread to take down appwrappers that have finished execution