Open agngrant opened 1 month ago
Additional Note: Pausing the queue and making the workloads which appear to be blocking the queue inactive, seems to temporarily unblock the queue but reinserting them ends with a similar issue.
Since you are using node labels in your Resource Flavor, this is likely a duplicate of https://github.com/kubernetes-sigs/kueue/issues/2391
Could you try using image gcr.io/k8s-staging-kueue/kueue:v20240619-v0.7.0-8-gaa682c90
to see if the issue persist?
Sorry we missed your issue earlier. Feel free to ping one of the OWNERS if you don't receive a response within a week.
What happened: Submitting after conversation on Slack Channel.
A K8s Cluster with multiple clusterqueues in the same cohort has each queue using BestEffortFIFO scheduling strategy.
One queue (queue-i) has both nominal quota and can borrow from another queue (queue-m). Submissions to queue-i after a time have started to stop being processed - a group of pending workloads appear to be blocking subsequent workloads from being considered.
Jobs A, B and C seem to get examined continuously by the reconciler with job E, F and G which could fit into the queue resources are not being scheduled.
Jobs E, F, G either are not given statuses or are not moved forward to use available resource.
What you expected to happen:
When Job A, B and C are waiting on 4 GPUs and Job E, F are only requiring CPU and G only require 1 GPU which is available for E,F and G to be added to run before A, B and C. This seems to be what the BestEffortFIFO description indicates.
This appears to work on other queues on the same cluster - though those queues are less populated.
How to reproduce it (as minimally and precisely as possible):
Five replicas of the queue controller.
With different variants of following (with values for each ranging up to 1000 in different tests):
ResourceFlavours: 11 All defined with yaml such as:
Main Cluster Queue - this has no user-queue and can only be borrowed from.
Standard Queues - can borrow from main cluster queue upto limited but no nominal quota.
Owned Queue - can borrow from main cluster queue but also has nominal quota on specific resource flavours.
submit N jobs to Owned queue + M jobs to other queues. There is still a large amount of resource available in the main queue, but the Owned queue has hit limits on several GPU types.
Submit a Job with no GPUs - this should be run quickly but it waits for days in the queue until GPUs are released and the queue moves forward.
cluster Anything else we need to know?:
The kueue-controller-manager is looping over the same set of jobs on the queue seemingly according to the logs:
This is an example of the log (jobs and identifiers altered):
Environment:
kubectl version
): v1.24.10 RKE2git describe --tags --dirty --always
): v0.6.2cat /etc/os-release
): Ubuntu 20.04uname -a
):