Support Gang Scheduling for PytorchJob on Kueue

FWCoder commented 1 month ago

What happened: When I submitted a PytorchJob that is required 8 GPUs on Master and 8 GPUs on Worker, it was admitted even though there is only 8 GPU available in the Cluster Queue. Both master and worker pods were created but only Master pod can move to Init and Running states. The Worker Pod is stuck on Pending until the Master pod move to Completed state. At that point, the Worker Pod will stuck on Init state since it is waiting for the Master pod to come up. (Deadlock Scenario)

This happens with "waitForPodsReady" enable.

What you expected to happen: Kueue Controller Manager will evaluate the total requested resources between both Master and Workers. It should blocks the job being admitted until there is enough resources in the Cluster Queue.

How to reproduce it (as minimally and precisely as possible):

Job Details:

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  labels:
    kueue.x-k8s.io/queue-name: <LOCAL_QUEUE_NAME>
  name: hello-world-kueue
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: Never
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
            - command:
                - "sleep"
                - "60"
              image: <PYTORCH_IMAGE>
              imagePullPolicy: Always
              name: pytorch
              resources:
                limits:
                  cpu: "86"
                  memory: 1037Gi
                  nvidia.com/gpu: "8"
                requests:
                  cpu: "86"
                  memory: 1037Gi
                  nvidia.com/gpu: "8"
          securityContext:
            runAsUser: 1000
    Worker:
      replicas: 1
      restartPolicy: Never
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
            - command:
                - "sleep"
                - "10"
              image:<PYTORCH_IMAGE>
              imagePullPolicy: Always
              name: pytorch
              resources:
                limits:
                  cpu: "86"
                  memory: 1037Gi
                  nvidia.com/gpu: "8"
                requests:
                  cpu: "86"
                  memory: 1037Gi
                  nvidia.com/gpu: "8"
          securityContext:
            runAsUser: 1000
  runPolicy:
    ttlSecondsAfterFinished: 604800

Create Job:

kubectl create -f hello-world-kueue.yaml

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version): 1.28
Kueue version (use git describe --tags --dirty --always): 0.6.1
Cloud provider or hardware configuration: AWS
```
### Tasks
```

mszadkow commented 1 week ago

/assign

mszadkow commented 1 week ago

@FWCoder Can you share more on:

the ClusterQueue configuration
available nodes - like a list of them with resources
node status after the job gets admitted

mszadkow commented 1 week ago

Let me explain why it's important. The total amount of resources available to the clusterQueue on the basis of which the decision is made may match. However, it is possible that the node configuration caused the master to "fit" into one node but the worker does not have the same node at its disposal. In other words, if there are 10 CPUs available in the cluster, the master needs 5 and the worker needs 5, but the nodes configuration is 5, 3, 2, the workload will be allocated. Thus we need both cluster configuration and nodes setup, if that will prove correct. Then we could look deeper, but so far I couldn't reproduce the issue.

alculquicondor commented 1 week ago

/triage needs-information

kubernetes-sigs / kueue

Support Gang Scheduling for PytorchJob on Kueue #2796