Open FWCoder opened 1 month ago
/assign
@FWCoder Can you share more on:
Let me explain why it's important. The total amount of resources available to the clusterQueue on the basis of which the decision is made may match. However, it is possible that the node configuration caused the master to "fit" into one node but the worker does not have the same node at its disposal. In other words, if there are 10 CPUs available in the cluster, the master needs 5 and the worker needs 5, but the nodes configuration is 5, 3, 2, the workload will be allocated. Thus we need both cluster configuration and nodes setup, if that will prove correct. Then we could look deeper, but so far I couldn't reproduce the issue.
/triage needs-information
What happened: When I submitted a PytorchJob that is required 8 GPUs on Master and 8 GPUs on Worker, it was admitted even though there is only 8 GPU available in the Cluster Queue. Both master and worker pods were created but only Master pod can move to
Init
andRunning
states. The Worker Pod is stuck onPending
until the Master pod move toCompleted
state. At that point, the Worker Pod will stuck onInit
state since it is waiting for the Master pod to come up. (Deadlock Scenario)This happens with "waitForPodsReady" enable.
What you expected to happen: Kueue Controller Manager will evaluate the total requested resources between both Master and Workers. It should blocks the job being admitted until there is enough resources in the Cluster Queue.
How to reproduce it (as minimally and precisely as possible):
Job Details:
Create Job:
Anything else we need to know?:
Environment:
kubectl version
): 1.28git describe --tags --dirty --always
): 0.6.1