Open dojoeisuke opened 1 year ago
can you successfully launch vgpu task?
can you successfully launch vgpu task?
No. The status of vcjob is pending
can you successfully launch vgpu task?
No. The status of vcjob is pending
Thanks for your reply, can provide the following information?
can you successfully launch vgpu task?
No. The status of vcjob is pending
Thanks for your reply, can provide the following information?
- gpu node annotations by using (kubectl describe node )
The below is gpu node's annotations:
root@k8s-tryvolcano-m001:~# k describe node k8s-tryvolcano-w004
Name: k8s-tryvolcano-w004
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/arch=amd64
kubernetes.io/hostname=k8s-tryvolcano-w004
kubernetes.io/os=linux
Annotations: kubeadm.alpha.kubernetes.io/cri-socket: unix:///var/run/containerd/containerd.sock
node.alpha.kubernetes.io/ttl: 0
projectcalico.org/IPv4Address: 192.168.100.168/24
projectcalico.org/IPv4IPIPTunnelAddr: 192.168.200.126
volumes.kubernetes.io/controller-managed-attach-detach: true
- can you launch the following example? https://github.com/volcano-sh/devices/blob/master/examples/vgpu-case02.yml
The below is description of podgroup when the example is launched.
root@k8s-tryvolcano-m001:~# k apply -f https://raw.githubusercontent.com/volcano-sh/devices/master/examples/vgpu-case02.yml
pod/pod1 created
root@k8s-tryvolcano-m001:~# k get po
NAME READY STATUS RESTARTS AGE
pod1 0/1 Pending 0 3m24s
root@k8s-tryvolcano-m001:~# k get podgroup
NAME STATUS MINMEMBER RUNNINGS AGE
podgroup-3bcd3bc5-f9e8-4600-b110-13eac02fe3d7 Pending 1 3m34s
root@k8s-tryvolcano-m001:~# k describe podgroup podgroup-3bcd3bc5-f9e8-4600-b110-13eac02fe3d7
...
Spec:
Min Member: 1
Min Resources:
count/pods: 1
Pods: 1
requests.volcano.sh/vgpu-memory: 1024
requests.volcano.sh/vgpu-number: 1
volcano.sh/vgpu-memory: 1024
volcano.sh/vgpu-number: 1
Queue: default
Status:
Conditions:
Last Transition Time: 2023-10-26T09:20:01Z
Message: 1/1 tasks in gang unschedulable: pod group is not ready, 1 Pending, 1 minAvailable; Pending: 1 Unschedulable
Reason: NotEnoughResources
Status: True
Transition ID: 24434f25-8ee7-4a06-a929-aa01c49b80a0
Type: Unschedulable
Phase: Pending
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unschedulable 3m48s (x12 over 3m59s) volcano 1/1 tasks in gang unschedulable: pod group is not ready, 1 Pending, 1 minAvailable; Pending: 1 Unschedulable
Normal Unschedulable 3m47s (x13 over 3m59s) volcano resource in cluster is overused
"resource in cluster is overused" message means job is reject by enqueue action.
"resource in cluster is overused" message means job is reject by enqueue action.
Upon checking the volcano-scheduler log, it seems that the cause is the absence of "volcano.sh/vgpu-number" in the "realCapability".
I1027 02:07:15.850507 1 proportion.go:230] The attributes of queue <default> in proportion: deserved <cpu 0.00, memory 0.00>, realCapability <cpu 10000.00, memory 24333668352.00, hugepages-1Gi 0.00, hugepages-2Mi 0.00, ephemeral-storage 467461047550000.00>, allocate <cpu 0.00, memory 0.00>, request <cpu 0.00, memory 0.00, volcano.sh/vgpu-number 1000.00>, elastic <cpu 0.00, memory 0.00>, share <0.00>
I1027 02:07:15.850531 1 proportion.go:242] Remaining resource is <cpu 10000.00, memory 24333668352.00, hugepages-2Mi 0.00, ephemeral-storage 467461047550000.00, hugepages-1Gi 0.00>
I1027 02:07:15.850555 1 proportion.go:244] Exiting when remaining is empty or no queue has more resource request: <cpu 10000.00, memory 24333668352.00, ephemeral-storage 467461047550000.00, hugepages-1Gi 0.00, hugepages-2Mi 0.00>
Note:
Since the past logs were no longer visible, pod1 was relaunched.
root@k8s-tryvolcano-m001:~# k get po
NAME READY STATUS RESTARTS AGE
pod1 0/1 Pending 0 7m1s
root@k8s-tryvolcano-m001:~# k get podgroup
NAME STATUS MINMEMBER RUNNINGS AGE
podgroup-8fe3417c-53b2-4933-bf99-fd4c4298675f Pending 1 7m4s
Upon checking the volcano-scheduler log, it seems that the cause is the absence of "volcano.sh/vgpu-number" in the "realCapability".
Yes, your node's describe show no volcano gpu informations!
Now volcano-device-plugin
pod on GPU node outputs "could not load NVML library".
root@k8s-tryvolcano-m001:~# k -n kube-system logs volcano-device-plugin-jtfxz
I1027 05:40:47.592928 1 main.go:77] Loading NVML
I1027 05:40:47.593106 1 main.go:79] Failed to initialize NVML: could not load NVML library.
I1027 05:40:47.593135 1 main.go:80] If this is a GPU node, did you set the docker default runtime to `nvidia`?
I1027 05:40:47.593146 1 main.go:81] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
I1027 05:40:47.593169 1 main.go:82] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
I1027 05:40:47.593180 1 main.go:83] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
F1027 05:40:47.593211 1 main.go:44] failed to initialize NVML: could not load NVML library
How to reproduce it (as minimally and precisely as possible):
Prerequisites:
- kubernetes cluster v1.24.3 is running
- Installed volocano
Reproduce:
- Install nvidia drivers in new GPU worker node.
- Install nvidia-docker2 in new GPU worker node.
- Install kubernetes in new GPU worker node.
- Join new GPU worker node to kubernetes cluster.
- Install volcano-vgpu-plugin.
Note: I refered to https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_vgpu.md.
Unfortunately, the above reproduction steps were not accurate. The below is omitted:
In other words, the fact that the NVML library was successfully loaded in the first log (below) might be due to the influence of the GPU operator.
"volocano-device-plugin" pod log
I1018 08:42:42.247448 1 main.go:77] Loading NVML I1018 08:42:42.317422 1 main.go:91] Starting FS watcher. I1018 08:42:42.317465 1 main.go:98] Starting OS watcher. I1018 08:42:42.317759 1 main.go:116] Retreiving plugins. I1018 08:42:42.317770 1 main.go:155] No devices found. Waiting indefinitely. I1018 08:42:42.317783 1 register.go:101] into WatchAndRegister I1018 08:42:42.360498 1 register.go:89] Reporting devices in 2023-10-18 08:42:42.360494312 +0000 UTC m=+0.116513880 I1018 08:43:12.468827 1 register.go:89] Reporting devices in 2023-10-18 08:43:12.468819399 +0000 UTC m=+30.224838968 I1018 08:43:42.485190 1 register.go:89] Reporting devices in 2023-10-18 08:43:42.485182962 +0000 UTC m=+60.241202532 I1018 08:44:12.505930 1 register.go:89] Reporting devices in 2023-10-18 08:44:12.505920612 +0000 UTC m=+90.261940182 I1018 08:44:42.523805 1 register.go:89] Reporting devices in 2023-10-18 08:44:42.523797163 +0000 UTC m=+120.279816722 I1018 08:45:12.542654 1 register.go:89] Reporting devices in 2023-10-18 08:45:12.542646375 +0000 UTC m=+150.298665943 I1018 08:45:42.564609 1 register.go:89] Reporting devices in 2023-10-18 08:45:42.564600701 +0000 UTC m=+180.320620270 I1018 08:46:12.584788 1 register.go:89] Reporting devices in 2023-10-18 08:46:12.584777812 +0000 UTC m=+210.340797381 I1018 08:46:42.653138 1 register.go:89] Reporting devices in 2023-10-18 08:46:42.653129051 +0000 UTC m=+240.409148620 I1018 08:47:12.674599 1 register.go:89] Reporting devices in 2023-10-18 08:47:12.674591614 +0000 UTC m=+270.430611183 I1018 08:47:42.690977 1 register.go:89] Reporting devices in 2023-10-18 08:47:42.69097107 +0000 UTC m=+300.446990640 I1018 08:48:12.707222 1 register.go:89] Reporting devices in 2023-10-18 08:48:12.707213231 +0000 UTC m=+330.463232800 I1018 08:48:42.781451 1 register.go:89] Reporting devices in 2023-10-18 08:48:42.781437965 +0000 UTC m=+360.537457544 I1018 08:49:12.816300 1 register.go:89] Reporting devices in 2023-10-18 08:49:12.816292362 +0000 UTC m=+390.572311921 I1018 08:49:42.834850 1 register.go:89] Reporting devices in 2023-10-18 08:49:42.834844163 +0000 UTC m=+420.590863732 I1018 08:50:12.855810 1 register.go:89] Reporting devices in 2023-10-18 08:50:12.855797817 +0000 UTC m=+450.611817406 I1018 08:50:42.875763 1 register.go:89] Reporting devices in 2023-10-18 08:50:42.875755678 +0000 UTC m=+480.631775247 I1018 08:51:12.892908 1 register.go:89] Reporting devices in 2023-10-18 08:51:12.89289625 +0000 UTC m=+510.648915829 I1018 08:51:42.913563 1 register.go:89] Reporting devices in 2023-10-18 08:51:42.913556355 +0000 UTC m=+540.669575924 I1018 08:52:12.938239 1 register.go:89] Reporting devices in 2023-10-18 08:52:12.93823072 +0000 UTC m=+570.694250290 I1018 08:52:42.968125 1 register.go:89] Reporting devices in 2023-10-18 08:52:42.968118172 +0000 UTC m=+600.724137731 I1018 08:53:12.988476 1 register.go:89] Reporting devices in 2023-10-18 08:53:12.988467434 +0000 UTC m=+630.744487003 ...
@dojoeisuke can i see /etc/docker/daemon.json on that GPU node?
@dojoeisuke can i see /etc/docker/daemon.json on that GPU node?
root@k8s-tryvolcano-w004:~# cat /etc/docker/daemon.json
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
}
}
can this issue be reproduced without install Gpu Operator?
can this issue be reproduced without install Gpu Operator?
I tried it.
volocano-device-plugin
pod on GPU node produced the following error output.
I1030 05:12:02.805254 1 main.go:77] Loading NVML
I1030 05:12:02.805419 1 main.go:79] Failed to initialize NVML: could not load NVML library.
I1030 05:12:02.805428 1 main.go:80] If this is a GPU node, did you set the docker default runtime to `nvidia`?
I1030 05:12:02.805431 1 main.go:81] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
I1030 05:12:02.805467 1 main.go:82] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
I1030 05:12:02.805473 1 main.go:83] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
F1030 05:12:02.805498 1 main.go:44] failed to initialize NVML: could not load NVML library
Also, the example manifest was not scheduling to GPU node.
root@k8s-tryvolcano-m001:~/gpu-check/devices# k get po
NAME READY STATUS RESTARTS AGE
pod1 0/1 Pending 0 9m8s
can this issue be reproduced without install Gpu Operator?
I tried it.
volocano-device-plugin
pod on GPU node produced the following error output.I1030 05:12:02.805254 1 main.go:77] Loading NVML I1030 05:12:02.805419 1 main.go:79] Failed to initialize NVML: could not load NVML library. I1030 05:12:02.805428 1 main.go:80] If this is a GPU node, did you set the docker default runtime to `nvidia`? I1030 05:12:02.805431 1 main.go:81] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites I1030 05:12:02.805467 1 main.go:82] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start I1030 05:12:02.805473 1 main.go:83] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes F1030 05:12:02.805498 1 main.go:44] failed to initialize NVML: could not load NVML library
Also, the example manifest was not scheduling to GPU node.
root@k8s-tryvolcano-m001:~/gpu-check/devices# k get po NAME READY STATUS RESTARTS AGE pod1 0/1 Pending 0 9m8s
Try the following command on GPU node: docker run -it --rm -e=NVIDIA_VISIBLE_DEVICES=0 --runtime=nvidia ubuntu:18.04 bash when inside container, try "nvidia-smi" see if it can works
There was an inadequacy in preparing the GPU node. In Kubernetes 1.24, it was necessary to install cri-dockerd and specify cri-dockerd as the cri-socket for "kubelet".
As a result, "volcano.sh/vgpu-number" is inclued in "allocatable" as expected.
root@k8s-tryvolcano-m001:~# k get node k8s-tryvolcano-w004 -ojson | jq .status.allocatable
{
"cpu": "2",
"ephemeral-storage": "93492209510",
"hugepages-1Gi": "0",
"hugepages-2Mi": "0",
"memory": "8050772Ki",
"pods": "110",
"volcano.sh/vgpu-number": "10"
}
Next I tried to launch a example manifest,
Note: the following fields was changed:
nvidia/cuda:10.1-base-ubuntu18.04
-> nvidia/cuda:12.1.0-base-ubuntu18.04
1
-> 2
it failed due to the lack of resources.
root@k8s-tryvolcano-m001:~# k get po
NAME READY STATUS RESTARTS AGE
pod1 0/1 Pending 0 80s
root@k8s-tryvolcano-m001:~# k get podgroup
NAME STATUS MINMEMBER RUNNINGS AGE
podgroup-d893dffa-4407-4b36-a9e9-3e031b0224f5 Inqueue 1 47s
root@k8s-tryvolcano-m001:~# k describe podgroup podgroup-d893dffa-4407-4b36-a9e9-3e031b0224f5
(snip)
Spec:
Min Member: 1
Min Resources:
count/pods: 1
Pods: 1
requests.volcano.sh/vgpu-memory: 1024
requests.volcano.sh/vgpu-number: 2
volcano.sh/vgpu-memory: 1024
volcano.sh/vgpu-number: 2
Queue: default
Status:
Conditions:
Last Transition Time: 2023-10-30T07:59:03Z
Message: 1/1 tasks in gang unschedulable: pod group is not ready, 1 Pending, 1 minAvailable; Pending: 1 Unschedulable
Reason: NotEnoughResources
Status: True
Transition ID: 84edb100-71c5-44d7-8c55-c5dabd7ae74f
Type: Unschedulable
Phase: Inqueue
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unschedulable 1s (x13 over 13s) volcano 1/1 tasks in gang unschedulable: pod group is not ready, 1 Pending, 1 minAvailable; Pending: 1 Unschedulable
Does this mean there is still an inadequacy in preparing the GPU node?
can this issue be reproduced without install Gpu Operator?
I tried it.
volocano-device-plugin
pod on GPU node produced the following error output.I1030 05:12:02.805254 1 main.go:77] Loading NVML I1030 05:12:02.805419 1 main.go:79] Failed to initialize NVML: could not load NVML library. I1030 05:12:02.805428 1 main.go:80] If this is a GPU node, did you set the docker default runtime to `nvidia`? I1030 05:12:02.805431 1 main.go:81] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites I1030 05:12:02.805467 1 main.go:82] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start I1030 05:12:02.805473 1 main.go:83] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes F1030 05:12:02.805498 1 main.go:44] failed to initialize NVML: could not load NVML library
Also, the example manifest was not scheduling to GPU node.
root@k8s-tryvolcano-m001:~/gpu-check/devices# k get po NAME READY STATUS RESTARTS AGE pod1 0/1 Pending 0 9m8s
Try the following command on GPU node: docker run -it --rm -e=NVIDIA_VISIBLE_DEVICES=0 --runtime=nvidia ubuntu:18.04 bash when inside container, try "nvidia-smi" see if it can works
It was successful.
root@k8s-tryvolcano-w004:~# docker run -it --rm -e=NVIDIA_VISIBLE_DEVICES=0 --runtime=nvidia ubuntu:18.04 bash
Unable to find image 'ubuntu:18.04' locally
18.04: Pulling from library/ubuntu
7c457f213c76: Pull complete
Digest: sha256:152dc042452c496007f07ca9127571cb9c29697f42acbfad72324b2bb2e43c98
Status: Downloaded newer image for ubuntu:18.04
root@3b1a7f3abe05:/# nvidia-smi
Mon Oct 30 08:23:46 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12 Driver Version: 535.104.12 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100 80GB PCIe Off | 00000000:00:05.0 Off | 0 |
| N/A 35C P0 44W / 300W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
root@3b1a7f3abe05:/# exit
exit
root@k8s-tryvolcano-w004:~#
@archlitchi
About https://github.com/volcano-sh/volcano/issues/3160#issuecomment-1784644826 , since "volocano.sh/vgpu-number" has become part of the allocatable resources, would it be better to close this issue? Also, should I submit a new issue about https://github.com/volcano-sh/volcano/issues/3160#issuecomment-1784664057 ?
Is your kubernetes cluster set up to use docker or containerd as its underlying container runtime? If it’s containerd, you need to follow the instructions under the containerd tab here to set it up: https://docs.nvidia.com/datacenter/cloud-native/kubernetes/install-k8s.html#install-nvidia-container-toolkit-nvidia-docker2
Is your kubernetes cluster set up to use docker or containerd as its underlying container runtime? If it’s containerd, you need to follow the instructions under the containerd tab here to set it up: https://docs.nvidia.com/datacenter/cloud-native/kubernetes/install-k8s.html#install-nvidia-container-toolkit-nvidia-docker2
The above URL seems to redirects to https://docs.nvidia.com/datacenter/cloud-native/index.html. Is the following URL correct? https://docs.nvidia.com/datacenter/cloud-native/kubernetes/latest/index.html
Is your problem fixed?@dojoeisuke. And is it caused by docker removed in kubernets v1.24? @archlitchi
@Monokaix
The problem has not been resolved, but I personally find it difficult to continue the investigation, so I will temporarily close this issue. Thank you for your support. @archlitchi
What happened:
I followed the user guide to set up vgpu, but "volocano.sh/vgpu-number" is not included in the allocatable resources.
user guide: https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_vgpu.md
What you expected to happen:
"volcano.sh/vgpu-number: XX" is included by executing the following command.
How to reproduce it (as minimally and precisely as possible):
Prerequisites:
Reproduce:
Note: I refered to https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_vgpu.md.
Anything else we need to know?:
Environment:
v1.8.0
kubectl version
):Cloud provider: OpenStack
uname -a
):kubeadm
Nvidia driver
nvidia-docker2
GPU
"volocano-device-plugin" pod log
volcano-scheduler-configmap