volcano-sh / volcano

A Cloud Native Batch System (Project under CNCF)
https://volcano.sh
Apache License 2.0
4.26k stars 976 forks source link

Does Volcano support scheduling of resource such as "amd.com/gpu" or "nvidia.com/gpu"? #3170

Closed dojoeisuke closed 11 months ago

dojoeisuke commented 1 year ago

What would you like to be added:

volcano supports "amd.com/gpu" and "nvidia.com/gpu" resource.

Why is this needed:

I want to schedule the launch of a Kubeflow notebook with a specified GPU using volcano. It seems that Kubeflow notebook currently does not support "volcano.sh/vgpu-number" resouce.

https://www.kubeflow.org/docs/components/notebooks/quickstart-guide/

lowang-bh commented 1 year ago

You need to install relative device plugins about "amd.com/gpu" and "nvidia.com/gpu", and then request it in your pod yamls.

william-wang commented 1 year ago

@dojoeisuke How is the thing going. Did you try lowang-bh's suggestion?

dojoeisuke commented 1 year ago

@dojoeisuke How is the thing going. Did you try lowang-bh's suggestion?

I attempted to integrate volcano with the GPU operator.

As a result, I found out that Volcano is capable of scheduling requests for "nvidia.com/gpu".

Env:

GPU node:

root@k8s-tryvolcano-m001:~# k get node k8s-tryvolcano-w002 -ojson | jq .status.allocatable
{
  "cpu": "2",
  "ephemeral-storage": "93492209510",
  "hugepages-1Gi": "0",
  "hugepages-2Mi": "0",
  "memory": "8050772Ki",
  "nvidia.com/gpu": "2",
  "pods": "110"
}
sample-vcjob.yaml ```yaml apiVersion: batch.volcano.sh/v1alpha1 kind: Job metadata: name: job1 spec: minAvailable: 1 schedulerName: volcano queue: default tasks: - replicas: 1 name: job template: spec: containers: - image: nvidia/cuda:12.2.2-base-ubuntu20.04 name: gpu command: ["sleep"] args: ["100000"] resources: requests: nvidia.com/gpu: 1 cpu: "100m" limits: nvidia.com/gpu: 1 cpu: "100m" restartPolicy: Never --- apiVersion: batch.volcano.sh/v1alpha1 kind: Job metadata: name: job2 spec: minAvailable: 1 schedulerName: volcano queue: default tasks: - replicas: 1 name: job template: spec: containers: - image: nvidia/cuda:12.2.2-base-ubuntu20.04 name: gpu command: ["sleep"] args: ["100000"] resources: requests: nvidia.com/gpu: 1 cpu: "100m" limits: nvidia.com/gpu: 1 cpu: "100m" restartPolicy: Never ```

Launch sample vcjobs

root@k8s-tryvolcano-m001:~# k get vcjob
NAME   STATUS    MINAVAILABLE   RUNNINGS   AGE
job1   Running   1              1          2m51s
job2   Running   1              1          2m51s
root@k8s-tryvolcano-m001:~# k get po
NAME         READY   STATUS    RESTARTS   AGE
job1-job-0   1/1     Running   0          3m3s
job2-job-0   1/1     Running   0          3m3s
root@k8s-tryvolcano-m001:~# k exec job1-job-0 -- nvidia-smi -L
GPU 0: NVIDIA A100 80GB PCIe (UUID: GPU-304cf5cc-8ae1-3a75-a2e3-49e2180c23b1)
root@k8s-tryvolcano-m001:~# k exec job2-job-0 -- nvidia-smi -L
GPU 0: NVIDIA A100 80GB PCIe (UUID: GPU-6d4e62a3-44ea-c4ce-3d99-2e4754bc9c5b)
dojoeisuke commented 1 year ago

In a training job using mnist, it fails in the middle of the training. Is this a problem with Volcano, or is it a problem with the Python code?

vcjob manifest for training ```yaml apiVersion: batch.volcano.sh/v1alpha1 kind: Job metadata: name: train1 spec: minAvailable: 1 schedulerName: volcano queue: default tasks: - replicas: 1 name: train1 template: spec: containers: - image: dojoeisuke/mnist-train:0.1 name: train1 resources: requests: nvidia.com/gpu: 1 cpu: "100m" limits: nvidia.com/gpu: 1 cpu: "100m" restartPolicy: Never --- apiVersion: batch.volcano.sh/v1alpha1 kind: Job metadata: name: train2 spec: minAvailable: 1 schedulerName: volcano queue: default tasks: - replicas: 1 name: train2 template: spec: containers: - image: dojoeisuke/mnist-train:0.1 name: train2 resources: requests: nvidia.com/gpu: 1 cpu: "100m" limits: nvidia.com/gpu: 1 cpu: "100m" restartPolicy: Never ```
python mnist ```python import tensorflow as tf gpu_devices = tf.config.experimental.list_physical_devices('GPU') if gpu_devices: print(f"Number of available GPU devices: {len(gpu_devices)}") for device in gpu_devices: print(f"Device name: {device.name}") else: print("There are no available GPU devices.") mnist = tf.keras.datasets.mnist (X_train, y_train), (X_test, y_test) = mnist.load_data() X_train, X_test = X_train / 255.0, X_test / 255.0 model = tf.keras.models.Sequential([ tf.keras.layers.Flatten(input_shape=(28, 28)), tf.keras.layers.Dense(128, activation='relu'), tf.keras.layers.Dense(128, activation='relu'), tf.keras.layers.Dropout(0.2), tf.keras.layers.Dense(10, activation='softmax')]) model.compile(optimizer='sgd', loss='sparse_categorical_crossentropy', metrics=['accuracy']) model.fit(X_train, y_train, epochs=20, validation_split=0.2) ```

launch vcjob and result

root@k8s-tryvolcano-m001:~# k get vcjob
NAME     STATUS      MINAVAILABLE   RUNNINGS   AGE
train1   Failed      1                         78m
train2   Completed   1                         78m
root@k8s-tryvolcano-m001:~# k get po
NAME              READY   STATUS      RESTARTS   AGE
train1-train1-0   0/1     Error       0          78m
train2-train2-0   0/1     Completed   0          78m

train1.log train2.log

lowang-bh commented 1 year ago

Pod has successfully scheduled to node and the gpu is allocated to it.

dojoeisuke commented 1 year ago

Pod has successfully scheduled to node and the gpu is allocated to it.

Does this mean that the cause lies elsewhere since Volcano is functioning correctly?

lowang-bh commented 1 year ago

Pod has successfully scheduled to node and the gpu is allocated to it.

Does this mean that the cause lies elsewhere since Volcano is functioning correctly?

Yes, I think so. The code to print "GPU devices" works well. You can check your image.

Reference: https://github.com/google-research/multinerf/issues/47 and https://github.com/tensorflow/tensorflow/issues/62075

dojoeisuke commented 11 months ago

"OOM-killer" occurred on worker node. So the cause of this issue was just out of memory.

root@k8s-tryvolcano-w002:~# grep -rin "3060ac04f74449db2b0d13d728b6c9a87765fc304" /var/log/syslog
8688:Nov 22 05:38:48 k8s-tryvolcano-w002 systemd[1]: Started crio-conmon-3060ac04f74449db2b0d13d728b6c9a87765fc3045a8cc86960e0b9f479c86f7.scope.
8691:Nov 22 05:38:48 k8s-tryvolcano-w002 systemd[1]: Started libcontainer container 3060ac04f74449db2b0d13d728b6c9a87765fc3045a8cc86960e0b9f479c86f7.
8694:Nov 22 05:38:51 k8s-tryvolcano-w002 crio[603]: time="2023-11-22 05:38:51.320973704Z" level=info msg="Created container 3060ac04f74449db2b0d13d728b6c9a87765fc3045a8cc86960e0b9f479c86f7: default/pod1-s96zd/pod1" id=dea13ccd-8549-42b4-978b-a0f970f9aef8 name=/runtime.v1.RuntimeService/CreateContainer
8695:Nov 22 05:38:51 k8s-tryvolcano-w002 crio[603]: time="2023-11-22 05:38:51.322220476Z" level=info msg="Starting container: 3060ac04f74449db2b0d13d728b6c9a87765fc3045a8cc86960e0b9f479c86f7" id=1671e7f2-c682-4681-be2e-580a463e8d1c name=/runtime.v1.RuntimeService/StartContainer
8698:Nov 22 05:38:51 k8s-tryvolcano-w002 crio[603]: time="2023-11-22 05:38:51.349652429Z" level=info msg="Started container" PID=57547 containerID=3060ac04f74449db2b0d13d728b6c9a87765fc3045a8cc86960e0b9f479c86f7 description=default/pod1-s96zd/pod1 id=1671e7f2-c682-4681-be2e-580a463e8d1c name=/runtime.v1.RuntimeService/StartContainer sandboxID=78239442e9449157ea5acef8bed6bf746845d19ac1896c3b44f8539784d18780
8886:Nov 22 05:39:07 k8s-tryvolcano-w002 kernel: [ 4563.431387] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=crio-08fda0c65d1fc792cd94b0875aeea3a49d5230d7f48ea28238085e5e4b90fd6c.scope,mems_allowed=0,global_oom,task_memcg=/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podf5fde39d_7e42_4003_9faf_b0ee64bf5b6d.slice/crio-3060ac04f74449db2b0d13d728b6c9a87765fc3045a8cc86960e0b9f479c86f7.scope,task=python,pid=57547,uid=0
8890:Nov 22 05:39:10 k8s-tryvolcano-w002 systemd[1]: crio-3060ac04f74449db2b0d13d728b6c9a87765fc3045a8cc86960e0b9f479c86f7.scope: Succeeded.
8896:Nov 22 05:39:12 k8s-tryvolcano-w002 systemd[1]: crio-3060ac04f74449db2b0d13d728b6c9a87765fc3045a8cc86960e0b9f479c86f7.scope: Consumed 4.267s CPU time.
8911:Nov 22 05:39:12 k8s-tryvolcano-w002 systemd[1]: crio-conmon-3060ac04f74449db2b0d13d728b6c9a87765fc3045a8cc86960e0b9f479c86f7.scope: Succeeded.
dojoeisuke commented 11 months ago

@lowang-bh thank you for your help. I will close this issue.