kubernetes-sigs / kueue

Kubernetes-native Job Queueing
https://kueue.sigs.k8s.io
Apache License 2.0
1.42k stars 258 forks source link

Overadmission after deleting resource from borrowing CQ #2678

Closed gabesaba closed 2 months ago

gabesaba commented 3 months ago

Consider the following scenario. We have two CQs in the same Cohort:

CQ1:
  1 CPU, 1 Memory
CQ2:
  2 CPU, 1 Memory

First, create WL1 in CQ1 which uses (2 CPU, 2 Memory) Next, create WL2 in CQ2 which uses (1 CPU, 1 Memory). WL2 is initially suspended, as there is no available Memory.

Update the CQ definitions so that CQ1 no longer provides Memory

CQ1:
  1 CPU
CQ2:
  2 CPU, 1 Memory

WL2 admits, while WL1 is still running. We have admitted (3 CPU, 3 Memory), while the Cohort has total of (3 CPU, 1 Memory).

We filter out usage of no longer existing FlavorResources here

mbobrovskyi commented 3 months ago

/assign

mimowo commented 3 months ago

Just to clarify, the issue is about making sure no workloads get admitted in the scenario (WL2 does not get admitted).

It is ok to let WL1 continue running. Eviction of over-committed workloads is out-of-scope of this issue. It could happen even when the capacity of a single CQ is reduced. We will handle / prioritize this independently.

mbobrovskyi commented 3 months ago

Is this a valid use case? Should we support this case? Or maybe we should prohibit user to do that on webhooks?

@alculquicondor WDYT?

alculquicondor commented 3 months ago

No, reducing or removing resources from a CQ is a valid use case