Closed dojoeisuke closed 11 months ago
You need to install relative device plugins about "amd.com/gpu" and "nvidia.com/gpu", and then request it in your pod yamls.
@dojoeisuke How is the thing going. Did you try lowang-bh's suggestion?
@dojoeisuke How is the thing going. Did you try lowang-bh's suggestion?
I attempted to integrate volcano with the GPU operator.
As a result, I found out that Volcano is capable of scheduling requests for "nvidia.com/gpu".
Env:
GPU node:
root@k8s-tryvolcano-m001:~# k get node k8s-tryvolcano-w002 -ojson | jq .status.allocatable
{
"cpu": "2",
"ephemeral-storage": "93492209510",
"hugepages-1Gi": "0",
"hugepages-2Mi": "0",
"memory": "8050772Ki",
"nvidia.com/gpu": "2",
"pods": "110"
}
Launch sample vcjobs
root@k8s-tryvolcano-m001:~# k get vcjob
NAME STATUS MINAVAILABLE RUNNINGS AGE
job1 Running 1 1 2m51s
job2 Running 1 1 2m51s
root@k8s-tryvolcano-m001:~# k get po
NAME READY STATUS RESTARTS AGE
job1-job-0 1/1 Running 0 3m3s
job2-job-0 1/1 Running 0 3m3s
root@k8s-tryvolcano-m001:~# k exec job1-job-0 -- nvidia-smi -L
GPU 0: NVIDIA A100 80GB PCIe (UUID: GPU-304cf5cc-8ae1-3a75-a2e3-49e2180c23b1)
root@k8s-tryvolcano-m001:~# k exec job2-job-0 -- nvidia-smi -L
GPU 0: NVIDIA A100 80GB PCIe (UUID: GPU-6d4e62a3-44ea-c4ce-3d99-2e4754bc9c5b)
In a training job using mnist, it fails in the middle of the training. Is this a problem with Volcano, or is it a problem with the Python code?
launch vcjob and result
root@k8s-tryvolcano-m001:~# k get vcjob
NAME STATUS MINAVAILABLE RUNNINGS AGE
train1 Failed 1 78m
train2 Completed 1 78m
root@k8s-tryvolcano-m001:~# k get po
NAME READY STATUS RESTARTS AGE
train1-train1-0 0/1 Error 0 78m
train2-train2-0 0/1 Completed 0 78m
Pod has successfully scheduled to node and the gpu is allocated to it.
Pod has successfully scheduled to node and the gpu is allocated to it.
Does this mean that the cause lies elsewhere since Volcano is functioning correctly?
Pod has successfully scheduled to node and the gpu is allocated to it.
Does this mean that the cause lies elsewhere since Volcano is functioning correctly?
Yes, I think so. The code to print "GPU devices" works well. You can check your image.
Reference: https://github.com/google-research/multinerf/issues/47 and https://github.com/tensorflow/tensorflow/issues/62075
"OOM-killer" occurred on worker node. So the cause of this issue was just out of memory.
root@k8s-tryvolcano-w002:~# grep -rin "3060ac04f74449db2b0d13d728b6c9a87765fc304" /var/log/syslog
8688:Nov 22 05:38:48 k8s-tryvolcano-w002 systemd[1]: Started crio-conmon-3060ac04f74449db2b0d13d728b6c9a87765fc3045a8cc86960e0b9f479c86f7.scope.
8691:Nov 22 05:38:48 k8s-tryvolcano-w002 systemd[1]: Started libcontainer container 3060ac04f74449db2b0d13d728b6c9a87765fc3045a8cc86960e0b9f479c86f7.
8694:Nov 22 05:38:51 k8s-tryvolcano-w002 crio[603]: time="2023-11-22 05:38:51.320973704Z" level=info msg="Created container 3060ac04f74449db2b0d13d728b6c9a87765fc3045a8cc86960e0b9f479c86f7: default/pod1-s96zd/pod1" id=dea13ccd-8549-42b4-978b-a0f970f9aef8 name=/runtime.v1.RuntimeService/CreateContainer
8695:Nov 22 05:38:51 k8s-tryvolcano-w002 crio[603]: time="2023-11-22 05:38:51.322220476Z" level=info msg="Starting container: 3060ac04f74449db2b0d13d728b6c9a87765fc3045a8cc86960e0b9f479c86f7" id=1671e7f2-c682-4681-be2e-580a463e8d1c name=/runtime.v1.RuntimeService/StartContainer
8698:Nov 22 05:38:51 k8s-tryvolcano-w002 crio[603]: time="2023-11-22 05:38:51.349652429Z" level=info msg="Started container" PID=57547 containerID=3060ac04f74449db2b0d13d728b6c9a87765fc3045a8cc86960e0b9f479c86f7 description=default/pod1-s96zd/pod1 id=1671e7f2-c682-4681-be2e-580a463e8d1c name=/runtime.v1.RuntimeService/StartContainer sandboxID=78239442e9449157ea5acef8bed6bf746845d19ac1896c3b44f8539784d18780
8886:Nov 22 05:39:07 k8s-tryvolcano-w002 kernel: [ 4563.431387] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=crio-08fda0c65d1fc792cd94b0875aeea3a49d5230d7f48ea28238085e5e4b90fd6c.scope,mems_allowed=0,global_oom,task_memcg=/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podf5fde39d_7e42_4003_9faf_b0ee64bf5b6d.slice/crio-3060ac04f74449db2b0d13d728b6c9a87765fc3045a8cc86960e0b9f479c86f7.scope,task=python,pid=57547,uid=0
8890:Nov 22 05:39:10 k8s-tryvolcano-w002 systemd[1]: crio-3060ac04f74449db2b0d13d728b6c9a87765fc3045a8cc86960e0b9f479c86f7.scope: Succeeded.
8896:Nov 22 05:39:12 k8s-tryvolcano-w002 systemd[1]: crio-3060ac04f74449db2b0d13d728b6c9a87765fc3045a8cc86960e0b9f479c86f7.scope: Consumed 4.267s CPU time.
8911:Nov 22 05:39:12 k8s-tryvolcano-w002 systemd[1]: crio-conmon-3060ac04f74449db2b0d13d728b6c9a87765fc3045a8cc86960e0b9f479c86f7.scope: Succeeded.
@lowang-bh thank you for your help. I will close this issue.
What would you like to be added:
volcano supports "amd.com/gpu" and "nvidia.com/gpu" resource.
Why is this needed:
I want to schedule the launch of a Kubeflow notebook with a specified GPU using volcano. It seems that Kubeflow notebook currently does not support "volcano.sh/vgpu-number" resouce.
https://www.kubeflow.org/docs/components/notebooks/quickstart-guide/