volcano-sh / volcano

A Cloud Native Batch System (Project under CNCF)
https://volcano.sh
Apache License 2.0
4.2k stars 963 forks source link

"volocano.sh/vgpu-number" is not included in the allocatable resources. #3160

Open dojoeisuke opened 1 year ago

dojoeisuke commented 1 year ago

What happened:

I followed the user guide to set up vgpu, but "volocano.sh/vgpu-number" is not included in the allocatable resources.

user guide: https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_vgpu.md

What you expected to happen:

"volcano.sh/vgpu-number: XX" is included by executing the following command.

root@k8s-tryvolcano-m001:~# k get node k8s-tryvolcano-w004 -ojson | jq .status.allocatable
{
  "cpu": "2",
  "ephemeral-storage": "93492209510",
  "hugepages-1Gi": "0",
  "hugepages-2Mi": "0",
  "memory": "8050764Ki",
  "pods": "110"
}

How to reproduce it (as minimally and precisely as possible):

Prerequisites:

Reproduce:

  1. Install nvidia drivers in new GPU worker node.
  2. Install nvidia-docker2 in new GPU worker node.
  3. Install kubernetes in new GPU worker node.
  4. Join new GPU worker node to kubernetes cluster.
  5. Install volcano-vgpu-plugin.

Note: I refered to https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_vgpu.md.

Anything else we need to know?:

Environment:

v1.8.0

root@k8s-tryvolcano-m001:~# k get node k8s-tryvolcano-w004 -owide
NAME                  STATUS   ROLES    AGE   VERSION   INTERNAL-IP       EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
k8s-tryvolcano-w004   Ready    <none>   18h   v1.24.3   192.168.100.168   <none>        Ubuntu 20.04.6 LTS   5.4.0-72-generic   containerd://1.7.2

Cloud provider: OpenStack

root@k8s-tryvolcano-w004:~# cat /etc/os-release 
NAME="Ubuntu"
VERSION="20.04.6 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.6 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
root@k8s-tryvolcano-w004:~# uname -a
Linux k8s-tryvolcano-w004 5.4.0-72-generic #80-Ubuntu SMP Mon Apr 12 17:35:00 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

kubeadm

Nvidia driver

root@k8s-tryvolcano-w004:~# dpkg -l | grep nvidia-driver
ii  nvidia-driver-535-server-open         535.104.12-0ubuntu0.20.04.1       amd64        NVIDIA driver (open kernel) metapackage

nvidia-docker2

root@k8s-tryvolcano-w004:~# dpkg -l | grep nvidia-docker
ii  nvidia-docker2                        2.13.0-1                          all          nvidia-docker CLI wrapper

GPU

root@k8s-tryvolcano-w004:~# nvidia-smi 
Thu Oct 19 02:24:55 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12             Driver Version: 535.104.12   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100 80GB PCIe          Off | 00000000:00:05.0 Off |                    0 |
| N/A   43C    P0              63W / 300W |      4MiB / 81920MiB |     20%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

"volocano-device-plugin" pod log

I1018 08:42:42.247448       1 main.go:77] Loading NVML
I1018 08:42:42.317422       1 main.go:91] Starting FS watcher.
I1018 08:42:42.317465       1 main.go:98] Starting OS watcher.
I1018 08:42:42.317759       1 main.go:116] Retreiving plugins.
I1018 08:42:42.317770       1 main.go:155] No devices found. Waiting indefinitely.
I1018 08:42:42.317783       1 register.go:101] into WatchAndRegister
I1018 08:42:42.360498       1 register.go:89] Reporting devices  in 2023-10-18 08:42:42.360494312 +0000 UTC m=+0.116513880
I1018 08:43:12.468827       1 register.go:89] Reporting devices  in 2023-10-18 08:43:12.468819399 +0000 UTC m=+30.224838968
I1018 08:43:42.485190       1 register.go:89] Reporting devices  in 2023-10-18 08:43:42.485182962 +0000 UTC m=+60.241202532
I1018 08:44:12.505930       1 register.go:89] Reporting devices  in 2023-10-18 08:44:12.505920612 +0000 UTC m=+90.261940182
I1018 08:44:42.523805       1 register.go:89] Reporting devices  in 2023-10-18 08:44:42.523797163 +0000 UTC m=+120.279816722
I1018 08:45:12.542654       1 register.go:89] Reporting devices  in 2023-10-18 08:45:12.542646375 +0000 UTC m=+150.298665943
I1018 08:45:42.564609       1 register.go:89] Reporting devices  in 2023-10-18 08:45:42.564600701 +0000 UTC m=+180.320620270
I1018 08:46:12.584788       1 register.go:89] Reporting devices  in 2023-10-18 08:46:12.584777812 +0000 UTC m=+210.340797381
I1018 08:46:42.653138       1 register.go:89] Reporting devices  in 2023-10-18 08:46:42.653129051 +0000 UTC m=+240.409148620
I1018 08:47:12.674599       1 register.go:89] Reporting devices  in 2023-10-18 08:47:12.674591614 +0000 UTC m=+270.430611183
I1018 08:47:42.690977       1 register.go:89] Reporting devices  in 2023-10-18 08:47:42.69097107 +0000 UTC m=+300.446990640
I1018 08:48:12.707222       1 register.go:89] Reporting devices  in 2023-10-18 08:48:12.707213231 +0000 UTC m=+330.463232800
I1018 08:48:42.781451       1 register.go:89] Reporting devices  in 2023-10-18 08:48:42.781437965 +0000 UTC m=+360.537457544
I1018 08:49:12.816300       1 register.go:89] Reporting devices  in 2023-10-18 08:49:12.816292362 +0000 UTC m=+390.572311921
I1018 08:49:42.834850       1 register.go:89] Reporting devices  in 2023-10-18 08:49:42.834844163 +0000 UTC m=+420.590863732
I1018 08:50:12.855810       1 register.go:89] Reporting devices  in 2023-10-18 08:50:12.855797817 +0000 UTC m=+450.611817406
I1018 08:50:42.875763       1 register.go:89] Reporting devices  in 2023-10-18 08:50:42.875755678 +0000 UTC m=+480.631775247
I1018 08:51:12.892908       1 register.go:89] Reporting devices  in 2023-10-18 08:51:12.89289625 +0000 UTC m=+510.648915829
I1018 08:51:42.913563       1 register.go:89] Reporting devices  in 2023-10-18 08:51:42.913556355 +0000 UTC m=+540.669575924
I1018 08:52:12.938239       1 register.go:89] Reporting devices  in 2023-10-18 08:52:12.93823072 +0000 UTC m=+570.694250290
I1018 08:52:42.968125       1 register.go:89] Reporting devices  in 2023-10-18 08:52:42.968118172 +0000 UTC m=+600.724137731
I1018 08:53:12.988476       1 register.go:89] Reporting devices  in 2023-10-18 08:53:12.988467434 +0000 UTC m=+630.744487003

... 

volcano-scheduler-configmap

root@k8s-tryvolcano-m001:~# kubectl get cm -n volcano-system volcano-scheduler-configmap -oyaml
apiVersion: v1
data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, backfill"
    tiers:
    - plugins:
      - name: priority
      - name: gang
        enablePreemptable: true
        enableReclaimable: false
      - name: conformance
    - plugins:
      - name: overcommit
      - name: drf
        enablePreemptable: false
      - name: predicates
        arguments:
          predicate.VGPUEnable: true
      - name: proportion
      - name: nodeorder
      - name: binpack
kind: ConfigMap
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","data":{"volcano-scheduler.conf":"actions: \"enqueue, allocate, backfill\"\ntiers:\n- plugins:\n  - name: priority\n  - name: gang\n    enablePreemptable: false\n  - name: conformance\n- plugins:\n  - name: overcommit\n  - name: drf\n    enablePreemptable: false\n  - name: predicates\n  - name: proportion\n  - name: nodeorder\n  - name: binpack\n"},"kind":"ConfigMap","metadata":{"annotations":{},"name":"volcano-scheduler-configmap","namespace":"volcano-system"}}
  creationTimestamp: "2023-09-21T04:44:44Z"
  name: volcano-scheduler-configmap
  namespace: volcano-system
  resourceVersion: "4267609"
  uid: 086455c9-7a0e-42b0-a938-4e56a6371207
archlitchi commented 1 year ago

can you successfully launch vgpu task?

dojoeisuke commented 1 year ago

can you successfully launch vgpu task?

No. The status of vcjob is pending

archlitchi commented 1 year ago

can you successfully launch vgpu task?

No. The status of vcjob is pending

Thanks for your reply, can provide the following information?

  1. gpu node annotations by using (kubectl describe node )
  2. can you launch the following example? https://github.com/volcano-sh/devices/blob/master/examples/vgpu-case02.yml
dojoeisuke commented 1 year ago

can you successfully launch vgpu task?

No. The status of vcjob is pending

Thanks for your reply, can provide the following information?

  1. gpu node annotations by using (kubectl describe node )

The below is gpu node's annotations:

root@k8s-tryvolcano-m001:~# k describe node k8s-tryvolcano-w004 
Name:               k8s-tryvolcano-w004
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=k8s-tryvolcano-w004
                    kubernetes.io/os=linux
Annotations:        kubeadm.alpha.kubernetes.io/cri-socket: unix:///var/run/containerd/containerd.sock
                    node.alpha.kubernetes.io/ttl: 0
                    projectcalico.org/IPv4Address: 192.168.100.168/24
                    projectcalico.org/IPv4IPIPTunnelAddr: 192.168.200.126
                    volumes.kubernetes.io/controller-managed-attach-detach: true
  1. can you launch the following example? https://github.com/volcano-sh/devices/blob/master/examples/vgpu-case02.yml

The below is description of podgroup when the example is launched.

root@k8s-tryvolcano-m001:~# k apply -f https://raw.githubusercontent.com/volcano-sh/devices/master/examples/vgpu-case02.yml
pod/pod1 created
root@k8s-tryvolcano-m001:~# k get po
NAME   READY   STATUS    RESTARTS   AGE
pod1   0/1     Pending   0          3m24s
root@k8s-tryvolcano-m001:~# k get podgroup
NAME                                            STATUS    MINMEMBER   RUNNINGS   AGE
podgroup-3bcd3bc5-f9e8-4600-b110-13eac02fe3d7   Pending   1                      3m34s
root@k8s-tryvolcano-m001:~# k describe podgroup podgroup-3bcd3bc5-f9e8-4600-b110-13eac02fe3d7 

... 

Spec:
  Min Member:  1
  Min Resources:
    count/pods:                       1
    Pods:                             1
    requests.volcano.sh/vgpu-memory:  1024
    requests.volcano.sh/vgpu-number:  1
    volcano.sh/vgpu-memory:           1024
    volcano.sh/vgpu-number:           1
  Queue:                              default
Status:
  Conditions:
    Last Transition Time:  2023-10-26T09:20:01Z
    Message:               1/1 tasks in gang unschedulable: pod group is not ready, 1 Pending, 1 minAvailable; Pending: 1 Unschedulable
    Reason:                NotEnoughResources
    Status:                True
    Transition ID:         24434f25-8ee7-4a06-a929-aa01c49b80a0
    Type:                  Unschedulable
  Phase:                   Pending
Events:
  Type     Reason         Age                     From     Message
  ----     ------         ----                    ----     -------
  Warning  Unschedulable  3m48s (x12 over 3m59s)  volcano  1/1 tasks in gang unschedulable: pod group is not ready, 1 Pending, 1 minAvailable; Pending: 1 Unschedulable
  Normal   Unschedulable  3m47s (x13 over 3m59s)  volcano  resource in cluster is overused
lowang-bh commented 1 year ago

"resource in cluster is overused" message means job is reject by enqueue action.

dojoeisuke commented 1 year ago

"resource in cluster is overused" message means job is reject by enqueue action.

Upon checking the volcano-scheduler log, it seems that the cause is the absence of "volcano.sh/vgpu-number" in the "realCapability".

I1027 02:07:15.850507       1 proportion.go:230] The attributes of queue <default> in proportion: deserved <cpu 0.00, memory 0.00>, realCapability <cpu 10000.00, memory 24333668352.00, hugepages-1Gi 0.00, hugepages-2Mi 0.00, ephemeral-storage 467461047550000.00>, allocate <cpu 0.00, memory 0.00>, request <cpu 0.00, memory 0.00, volcano.sh/vgpu-number 1000.00>, elastic <cpu 0.00, memory 0.00>, share <0.00>
I1027 02:07:15.850531       1 proportion.go:242] Remaining resource is  <cpu 10000.00, memory 24333668352.00, hugepages-2Mi 0.00, ephemeral-storage 467461047550000.00, hugepages-1Gi 0.00>
I1027 02:07:15.850555       1 proportion.go:244] Exiting when remaining is empty or no queue has more resource request:  <cpu 10000.00, memory 24333668352.00, ephemeral-storage 467461047550000.00, hugepages-1Gi 0.00, hugepages-2Mi 0.00>

volcano-scheduler.log

Note:

Since the past logs were no longer visible, pod1 was relaunched.

root@k8s-tryvolcano-m001:~# k get po
NAME   READY   STATUS    RESTARTS   AGE
pod1   0/1     Pending   0          7m1s
root@k8s-tryvolcano-m001:~# k get podgroup
NAME                                            STATUS    MINMEMBER   RUNNINGS   AGE
podgroup-8fe3417c-53b2-4933-bf99-fd4c4298675f   Pending   1                      7m4s
lowang-bh commented 1 year ago

Upon checking the volcano-scheduler log, it seems that the cause is the absence of "volcano.sh/vgpu-number" in the "realCapability".

Yes, your node's describe show no volcano gpu informations!

dojoeisuke commented 1 year ago

Now volcano-device-plugin pod on GPU node outputs "could not load NVML library".

root@k8s-tryvolcano-m001:~# k -n kube-system logs volcano-device-plugin-jtfxz 
I1027 05:40:47.592928       1 main.go:77] Loading NVML
I1027 05:40:47.593106       1 main.go:79] Failed to initialize NVML: could not load NVML library.
I1027 05:40:47.593135       1 main.go:80] If this is a GPU node, did you set the docker default runtime to `nvidia`?
I1027 05:40:47.593146       1 main.go:81] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
I1027 05:40:47.593169       1 main.go:82] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
I1027 05:40:47.593180       1 main.go:83] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
F1027 05:40:47.593211       1 main.go:44] failed to initialize NVML: could not load NVML library

How to reproduce it (as minimally and precisely as possible):

Prerequisites:

  • kubernetes cluster v1.24.3 is running
  • Installed volocano

Reproduce:

  1. Install nvidia drivers in new GPU worker node.
  2. Install nvidia-docker2 in new GPU worker node.
  3. Install kubernetes in new GPU worker node.
  4. Join new GPU worker node to kubernetes cluster.
  5. Install volcano-vgpu-plugin.

Note: I refered to https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_vgpu.md.

Unfortunately, the above reproduction steps were not accurate. The below is omitted:

In other words, the fact that the NVML library was successfully loaded in the first log (below) might be due to the influence of the GPU operator.

"volocano-device-plugin" pod log

I1018 08:42:42.247448       1 main.go:77] Loading NVML
I1018 08:42:42.317422       1 main.go:91] Starting FS watcher.
I1018 08:42:42.317465       1 main.go:98] Starting OS watcher.
I1018 08:42:42.317759       1 main.go:116] Retreiving plugins.
I1018 08:42:42.317770       1 main.go:155] No devices found. Waiting indefinitely.
I1018 08:42:42.317783       1 register.go:101] into WatchAndRegister
I1018 08:42:42.360498       1 register.go:89] Reporting devices  in 2023-10-18 08:42:42.360494312 +0000 UTC m=+0.116513880
I1018 08:43:12.468827       1 register.go:89] Reporting devices  in 2023-10-18 08:43:12.468819399 +0000 UTC m=+30.224838968
I1018 08:43:42.485190       1 register.go:89] Reporting devices  in 2023-10-18 08:43:42.485182962 +0000 UTC m=+60.241202532
I1018 08:44:12.505930       1 register.go:89] Reporting devices  in 2023-10-18 08:44:12.505920612 +0000 UTC m=+90.261940182
I1018 08:44:42.523805       1 register.go:89] Reporting devices  in 2023-10-18 08:44:42.523797163 +0000 UTC m=+120.279816722
I1018 08:45:12.542654       1 register.go:89] Reporting devices  in 2023-10-18 08:45:12.542646375 +0000 UTC m=+150.298665943
I1018 08:45:42.564609       1 register.go:89] Reporting devices  in 2023-10-18 08:45:42.564600701 +0000 UTC m=+180.320620270
I1018 08:46:12.584788       1 register.go:89] Reporting devices  in 2023-10-18 08:46:12.584777812 +0000 UTC m=+210.340797381
I1018 08:46:42.653138       1 register.go:89] Reporting devices  in 2023-10-18 08:46:42.653129051 +0000 UTC m=+240.409148620
I1018 08:47:12.674599       1 register.go:89] Reporting devices  in 2023-10-18 08:47:12.674591614 +0000 UTC m=+270.430611183
I1018 08:47:42.690977       1 register.go:89] Reporting devices  in 2023-10-18 08:47:42.69097107 +0000 UTC m=+300.446990640
I1018 08:48:12.707222       1 register.go:89] Reporting devices  in 2023-10-18 08:48:12.707213231 +0000 UTC m=+330.463232800
I1018 08:48:42.781451       1 register.go:89] Reporting devices  in 2023-10-18 08:48:42.781437965 +0000 UTC m=+360.537457544
I1018 08:49:12.816300       1 register.go:89] Reporting devices  in 2023-10-18 08:49:12.816292362 +0000 UTC m=+390.572311921
I1018 08:49:42.834850       1 register.go:89] Reporting devices  in 2023-10-18 08:49:42.834844163 +0000 UTC m=+420.590863732
I1018 08:50:12.855810       1 register.go:89] Reporting devices  in 2023-10-18 08:50:12.855797817 +0000 UTC m=+450.611817406
I1018 08:50:42.875763       1 register.go:89] Reporting devices  in 2023-10-18 08:50:42.875755678 +0000 UTC m=+480.631775247
I1018 08:51:12.892908       1 register.go:89] Reporting devices  in 2023-10-18 08:51:12.89289625 +0000 UTC m=+510.648915829
I1018 08:51:42.913563       1 register.go:89] Reporting devices  in 2023-10-18 08:51:42.913556355 +0000 UTC m=+540.669575924
I1018 08:52:12.938239       1 register.go:89] Reporting devices  in 2023-10-18 08:52:12.93823072 +0000 UTC m=+570.694250290
I1018 08:52:42.968125       1 register.go:89] Reporting devices  in 2023-10-18 08:52:42.968118172 +0000 UTC m=+600.724137731
I1018 08:53:12.988476       1 register.go:89] Reporting devices  in 2023-10-18 08:53:12.988467434 +0000 UTC m=+630.744487003

... 
archlitchi commented 1 year ago

@dojoeisuke can i see /etc/docker/daemon.json on that GPU node?

dojoeisuke commented 1 year ago

@dojoeisuke can i see /etc/docker/daemon.json on that GPU node?

root@k8s-tryvolcano-w004:~# cat /etc/docker/daemon.json
{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}
archlitchi commented 1 year ago

can this issue be reproduced without install Gpu Operator?

dojoeisuke commented 1 year ago

can this issue be reproduced without install Gpu Operator?

I tried it.

volocano-device-plugin pod on GPU node produced the following error output.

I1030 05:12:02.805254       1 main.go:77] Loading NVML
I1030 05:12:02.805419       1 main.go:79] Failed to initialize NVML: could not load NVML library.
I1030 05:12:02.805428       1 main.go:80] If this is a GPU node, did you set the docker default runtime to `nvidia`?
I1030 05:12:02.805431       1 main.go:81] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
I1030 05:12:02.805467       1 main.go:82] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
I1030 05:12:02.805473       1 main.go:83] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
F1030 05:12:02.805498       1 main.go:44] failed to initialize NVML: could not load NVML library

Also, the example manifest was not scheduling to GPU node.

root@k8s-tryvolcano-m001:~/gpu-check/devices# k get po
NAME   READY   STATUS    RESTARTS   AGE
pod1   0/1     Pending   0          9m8s
archlitchi commented 1 year ago

can this issue be reproduced without install Gpu Operator?

I tried it.

volocano-device-plugin pod on GPU node produced the following error output.

I1030 05:12:02.805254       1 main.go:77] Loading NVML
I1030 05:12:02.805419       1 main.go:79] Failed to initialize NVML: could not load NVML library.
I1030 05:12:02.805428       1 main.go:80] If this is a GPU node, did you set the docker default runtime to `nvidia`?
I1030 05:12:02.805431       1 main.go:81] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
I1030 05:12:02.805467       1 main.go:82] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
I1030 05:12:02.805473       1 main.go:83] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
F1030 05:12:02.805498       1 main.go:44] failed to initialize NVML: could not load NVML library

Also, the example manifest was not scheduling to GPU node.

root@k8s-tryvolcano-m001:~/gpu-check/devices# k get po
NAME   READY   STATUS    RESTARTS   AGE
pod1   0/1     Pending   0          9m8s

Try the following command on GPU node: docker run -it --rm -e=NVIDIA_VISIBLE_DEVICES=0 --runtime=nvidia ubuntu:18.04 bash when inside container, try "nvidia-smi" see if it can works

dojoeisuke commented 1 year ago

There was an inadequacy in preparing the GPU node. In Kubernetes 1.24, it was necessary to install cri-dockerd and specify cri-dockerd as the cri-socket for "kubelet".

As a result, "volcano.sh/vgpu-number" is inclued in "allocatable" as expected.

root@k8s-tryvolcano-m001:~# k get node k8s-tryvolcano-w004 -ojson | jq .status.allocatable
{
  "cpu": "2",
  "ephemeral-storage": "93492209510",
  "hugepages-1Gi": "0",
  "hugepages-2Mi": "0",
  "memory": "8050772Ki",
  "pods": "110",
  "volcano.sh/vgpu-number": "10"
}
dojoeisuke commented 1 year ago

Next I tried to launch a example manifest,

Note: the following fields was changed:

it failed due to the lack of resources.

root@k8s-tryvolcano-m001:~# k get po
NAME   READY   STATUS    RESTARTS   AGE
pod1   0/1     Pending   0          80s
root@k8s-tryvolcano-m001:~# k get podgroup
NAME                                            STATUS    MINMEMBER   RUNNINGS   AGE
podgroup-d893dffa-4407-4b36-a9e9-3e031b0224f5   Inqueue   1                      47s
root@k8s-tryvolcano-m001:~# k describe podgroup podgroup-d893dffa-4407-4b36-a9e9-3e031b0224f5 

(snip)

Spec:
  Min Member:  1
  Min Resources:
    count/pods:                       1
    Pods:                             1
    requests.volcano.sh/vgpu-memory:  1024
    requests.volcano.sh/vgpu-number:  2
    volcano.sh/vgpu-memory:           1024
    volcano.sh/vgpu-number:           2
  Queue:                              default
Status:
  Conditions:
    Last Transition Time:  2023-10-30T07:59:03Z
    Message:               1/1 tasks in gang unschedulable: pod group is not ready, 1 Pending, 1 minAvailable; Pending: 1 Unschedulable
    Reason:                NotEnoughResources
    Status:                True
    Transition ID:         84edb100-71c5-44d7-8c55-c5dabd7ae74f
    Type:                  Unschedulable
  Phase:                   Inqueue
Events:
  Type     Reason         Age                From     Message
  ----     ------         ----               ----     -------
  Warning  Unschedulable  1s (x13 over 13s)  volcano  1/1 tasks in gang unschedulable: pod group is not ready, 1 Pending, 1 minAvailable; Pending: 1 Unschedulable

Does this mean there is still an inadequacy in preparing the GPU node?

dojoeisuke commented 1 year ago

can this issue be reproduced without install Gpu Operator?

I tried it. volocano-device-plugin pod on GPU node produced the following error output.

I1030 05:12:02.805254       1 main.go:77] Loading NVML
I1030 05:12:02.805419       1 main.go:79] Failed to initialize NVML: could not load NVML library.
I1030 05:12:02.805428       1 main.go:80] If this is a GPU node, did you set the docker default runtime to `nvidia`?
I1030 05:12:02.805431       1 main.go:81] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
I1030 05:12:02.805467       1 main.go:82] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
I1030 05:12:02.805473       1 main.go:83] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
F1030 05:12:02.805498       1 main.go:44] failed to initialize NVML: could not load NVML library

Also, the example manifest was not scheduling to GPU node.

root@k8s-tryvolcano-m001:~/gpu-check/devices# k get po
NAME   READY   STATUS    RESTARTS   AGE
pod1   0/1     Pending   0          9m8s

Try the following command on GPU node: docker run -it --rm -e=NVIDIA_VISIBLE_DEVICES=0 --runtime=nvidia ubuntu:18.04 bash when inside container, try "nvidia-smi" see if it can works

It was successful.

root@k8s-tryvolcano-w004:~# docker run -it --rm -e=NVIDIA_VISIBLE_DEVICES=0 --runtime=nvidia ubuntu:18.04 bash
Unable to find image 'ubuntu:18.04' locally
18.04: Pulling from library/ubuntu
7c457f213c76: Pull complete 
Digest: sha256:152dc042452c496007f07ca9127571cb9c29697f42acbfad72324b2bb2e43c98
Status: Downloaded newer image for ubuntu:18.04
root@3b1a7f3abe05:/# nvidia-smi
Mon Oct 30 08:23:46 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12             Driver Version: 535.104.12   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100 80GB PCIe          Off | 00000000:00:05.0 Off |                    0 |
| N/A   35C    P0              44W / 300W |      4MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
root@3b1a7f3abe05:/# exit
exit
root@k8s-tryvolcano-w004:~# 
dojoeisuke commented 1 year ago

@archlitchi

About https://github.com/volcano-sh/volcano/issues/3160#issuecomment-1784644826 , since "volocano.sh/vgpu-number" has become part of the allocatable resources, would it be better to close this issue? Also, should I submit a new issue about https://github.com/volcano-sh/volcano/issues/3160#issuecomment-1784664057 ?

archlitchi commented 1 year ago

Is your kubernetes cluster set up to use docker or containerd as its underlying container runtime? If it’s containerd, you need to follow the instructions under the containerd tab here to set it up: https://docs.nvidia.com/datacenter/cloud-native/kubernetes/install-k8s.html#install-nvidia-container-toolkit-nvidia-docker2

dojoeisuke commented 1 year ago

Is your kubernetes cluster set up to use docker or containerd as its underlying container runtime? If it’s containerd, you need to follow the instructions under the containerd tab here to set it up: https://docs.nvidia.com/datacenter/cloud-native/kubernetes/install-k8s.html#install-nvidia-container-toolkit-nvidia-docker2

The above URL seems to redirects to https://docs.nvidia.com/datacenter/cloud-native/index.html. Is the following URL correct? https://docs.nvidia.com/datacenter/cloud-native/kubernetes/latest/index.html

Monokaix commented 9 months ago

Is your problem fixed?@dojoeisuke. And is it caused by docker removed in kubernets v1.24? @archlitchi

dojoeisuke commented 9 months ago

@Monokaix

The problem has not been resolved, but I personally find it difficult to continue the investigation, so I will temporarily close this issue. Thank you for your support. @archlitchi