tkestack / gpu-manager

Other
829 stars 235 forks source link

0/1 nodes are available: 1 Insufficient tencent.com/vcuda-core, 1 Insufficient tencent.com/vcuda-memory. #87

Open timozerrer opened 3 years ago

timozerrer commented 3 years ago

Hello,

Pod is pending and fails to be scheduled properly. I deployed gpu-manager and gpu-admission controller. Do i need any nvidia deployments or only cuda / graphics driver?

Docker: 20.10.6 Kubelet:1.21.0

Deployment.yaml: ` apiVersion: apps/v1 kind: Deployment

metadata: name: mnist-test labels: app: mnist-test spec: replicas: 1

selector: # define how the deployment finds the pods it manages matchLabels: app: mnist-test

template: # define the pods specifications metadata: labels: app: mnist-test

spec:
  containers:
  - name: mnist-test
    image: localhost:5000/usecase/mnist:latest
    resources:
      requests:
        tencent.com/vcuda-core: 100
        tencent.com/vcuda-memory: 100
      limits:
        tencent.com/vcuda-core: 100
        tencent.com/vcuda-memory: 100

`

gpu-manager logs:

copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-ml.so to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-ml.so.1 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-ml.so.465.19.01 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libcuda.so to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libcuda.so.465.19.01 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libcuda.so.1 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.465.19.01 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-opencl.so.1 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-opencl.so.465.19.01 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-compiler.so.465.19.01 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia.so.465.19.01 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia.so to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia.so.1 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-encode.so.1 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-encode.so to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-encode.so.465.19.01 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvcuvid.so.1 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvcuvid.so.465.19.01 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvcuvid.so to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-fbc.so.465.19.01 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-fbc.so.1 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-fbc.so to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-ifr.so to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-ifr.so.1 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-ifr.so.465.19.01 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGL.so.1.7.0 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGL.so.1 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGLX.so.0.0.0 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGLX.so.0 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libOpenGL.so.0.0.0 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libOpenGL.so.0 to /usr/local/nvidia/lib64 copy /usr/local/host/share/code/swiftshader/libGLESv2.so to /usr/local/nvidia/lib64 copy /usr/local/host/share/code/libGLESv2.so to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGLESv2.so.2.1.0 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGLESv2.so.2 to /usr/local/nvidia/lib64 copy /usr/local/host/share/code/libEGL.so to /usr/local/nvidia/lib64 copy /usr/local/host/share/code/swiftshader/libEGL.so to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libEGL.so.1.1.0 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libEGL.so.1 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGLdispatch.so.0.0.0 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGLdispatch.so.0 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGLX_nvidia.so.0 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGLX_nvidia.so.465.19.01 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libEGL_nvidia.so.0 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libEGL_nvidia.so.465.19.01 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.465.19.01 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.2 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.465.19.01 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.1 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-eglcore.so.465.19.01 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-glcore.so.465.19.01 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-tls.so.465.19.01 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-glsi.so.465.19.01 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.1 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-opticalflow.so to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.465.19.01 to /usr/local/nvidia/lib64 copy /usr/local/host/bin/nvidia-cuda-mps-control to /usr/local/nvidia/bin/ copy /usr/local/host/bin/nvidia-cuda-mps-server to /usr/local/nvidia/bin/ copy /usr/local/host/bin/nvidia-debugdump to /usr/local/nvidia/bin/ copy /usr/local/host/bin/nvidia-persistenced to /usr/local/nvidia/bin/ copy /usr/local/host/bin/nvidia-smi to /usr/local/nvidia/bin/ rebuild ldcache launch gpu manager E0508 13:51:05.283586 9794 server.go:132] Unable to set Type=notify in systemd service file?

Node description:

$ kubectl describe no tke-ubuntu-pc
Name:               tke-ubuntu-pc
Roles:              control-plane,master
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=tke-ubuntu-pc
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/control-plane=
                    node-role.kubernetes.io/master=
                    node.kubernetes.io/exclude-from-external-load-balancers=
                    nvidia-device-enable=enable
Annotations:        kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
                    node.alpha.kubernetes.io/ttl: 0
                    projectcalico.org/IPv4Address: 10.28.11.59/24
                    projectcalico.org/IPv4IPIPTunnelAddr: 192.168.4.128
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Sat, 08 May 2021 12:35:29 +0000
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  tke-ubuntu-pc
  AcquireTime:     <unset>
  RenewTime:       Sat, 08 May 2021 14:15:13 +0000
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Sat, 08 May 2021 13:50:41 +0000   Sat, 08 May 2021 13:50:41 +0000   CalicoIsUp                   Calico is running on this node
  MemoryPressure       False   Sat, 08 May 2021 14:15:13 +0000   Sat, 08 May 2021 12:35:27 +0000   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Sat, 08 May 2021 14:15:13 +0000   Sat, 08 May 2021 12:35:27 +0000   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Sat, 08 May 2021 14:15:13 +0000   Sat, 08 May 2021 12:35:27 +0000   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Sat, 08 May 2021 14:15:13 +0000   Sat, 08 May 2021 12:37:28 +0000   KubeletReady                 kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:  10.28.11.59
  Hostname:    tke-ubuntu-pc
Capacity:
  cpu:                       16
  ephemeral-storage:         118882128Ki
  hugepages-1Gi:             0
  hugepages-2Mi:             0
  memory:                    32851496Ki
  pods:                      110
  tencent.com/vcuda-core:    0
  tencent.com/vcuda-memory:  0
Allocatable:
  cpu:                       16
  ephemeral-storage:         109561768984
  hugepages-1Gi:             0
  hugepages-2Mi:             0
  memory:                    32749096Ki
  pods:                      110
  tencent.com/vcuda-core:    0
  tencent.com/vcuda-memory:  0
System Info:
  Machine ID:                 b829848e929e405b849ec3a862ad7542
  System UUID:                eaaa97d6-94eb-b002-1c34-244bfe00f638
  Boot ID:                    51559bd2-92c8-4536-8fb3-2d4bdbbb10f1
  Kernel Version:             5.8.0-50-generic
  OS Image:                   Ubuntu 20.10
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  docker://20.10.6
  Kubelet Version:            v1.21.0
  Kube-Proxy Version:         v1.21.0
PodCIDR:                      192.168.0.0/24
PodCIDRs:                     192.168.0.0/24
Non-terminated Pods:          (12 in total)
  Namespace                   Name                                          CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                          ------------  ----------  ---------------  -------------  ---
  kube-system                 calico-kube-controllers-b656ddcfc-vbrzw       0 (0%)        0 (0%)      0 (0%)           0 (0%)         98m
  kube-system                 calico-node-jwp2q                             250m (1%)     0 (0%)      0 (0%)           0 (0%)         98m
  kube-system                 coredns-558bd4d5db-fmrs4                      100m (0%)     0 (0%)      70Mi (0%)        170Mi (0%)     99m
  kube-system                 coredns-558bd4d5db-ksvzf                      100m (0%)     0 (0%)      70Mi (0%)        170Mi (0%)     99m
  kube-system                 etcd-tke-ubuntu-pc                            100m (0%)     0 (0%)      100Mi (0%)       0 (0%)         99m
  kube-system                 gpu-manager-daemonset-fgb67                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         48m
  kube-system                 kube-apiserver-tke-ubuntu-pc                  250m (1%)     0 (0%)      0 (0%)           0 (0%)         99m
  kube-system                 kube-controller-manager-tke-ubuntu-pc         200m (1%)     0 (0%)      0 (0%)           0 (0%)         99m
  kube-system                 kube-proxy-4tgqq                              0 (0%)        0 (0%)      0 (0%)           0 (0%)         99m
  kube-system                 kube-scheduler-tke-ubuntu-pc                  100m (0%)     0 (0%)      0 (0%)           0 (0%)         29m
  kubernetes-dashboard        dashboard-metrics-scraper-5594697f48-8cspl    0 (0%)        0 (0%)      0 (0%)           0 (0%)         87m
  kubernetes-dashboard        kubernetes-dashboard-57c9bfc8c8-xqjl8         0 (0%)        0 (0%)      0 (0%)           0 (0%)         87m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                  Requests    Limits
  --------                  --------    ------
  cpu                       1100m (6%)  0 (0%)
  memory                    240Mi (0%)  340Mi (1%)
  ephemeral-storage         100Mi (0%)  0 (0%)
  hugepages-1Gi             0 (0%)      0 (0%)
  hugepages-2Mi             0 (0%)      0 (0%)
  tencent.com/vcuda-core    0           0
  tencent.com/vcuda-memory  0           0
Events:
  Type    Reason                   Age                From        Message
  ----    ------                   ----               ----        -------
  Normal  Starting                 26m                kubelet     Starting kubelet.
  Normal  NodeHasSufficientPID     25m (x7 over 26m)  kubelet     Node tke-ubuntu-pc status is now: NodeHasSufficientPID
  Normal  NodeAllocatableEnforced  25m                kubelet     Updated Node Allocatable limit across pods
  Normal  NodeHasSufficientMemory  25m (x8 over 26m)  kubelet     Node tke-ubuntu-pc status is now: NodeHasSufficientMemory
  Normal  NodeHasNoDiskPressure    25m (x8 over 26m)  kubelet     Node tke-ubuntu-pc status is now: NodeHasNoDiskPressure
  Normal  Starting                 24m                kube-proxy  Starting kube-proxy.
timozerrer commented 3 years ago

After upgrading the scheduler to tkestack/gpu-manager:v1.1.4, pods are allocated but crashing with: Error: device plugin PreStartContainer rpc failed with err: rpc error: code = Unknown desc = PreStartContainer check failed, failed to read from checkpoint file due to json: cannot unmarshal object into Go struct field PodDevicesEntry.Data.PodDeviceEntries.DeviceIDs of type []string

swartz-k commented 3 years ago

It looks like scheduler gpu-manager failed, In my situation restart will be helpful.

ruankee commented 3 years ago

After upgrading the scheduler to tkestack/gpu-manager:v1.1.4, pods are allocated but crashing with: Error: device plugin PreStartContainer rpc failed with err: rpc error: code = Unknown desc = PreStartContainer check failed, failed to read from checkpoint file due to json: cannot unmarshal object into Go struct field PodDevicesEntry.Data.PodDeviceEntries.DeviceIDs of type []string

Same issue.Have u solved it?