未能实现限制GPU显存和GPU使用率

您好！我尝试用tensorflow运行了resnet50模型，结果发现未能限制GPU显存和GPU使用率，GPU显存和使用率几乎被占满。nvidia-smi的结果如下：

Tue Apr 13 16:57:35 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.36   Driver Version: 440.36       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:04:00.0 Off |                  N/A |
| 22%   47C    P2   246W / 250W |  10869MiB / 11176MiB |     97%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     25326      C   python3                                    10859MiB |
+-----------------------------------------------------------------------------+

请教几个问题。

问题1

我的集群中有多个节点，其中master节点没有GPU，只有node8节点有GPU。因此在yaml文件中通过nodeName字段指定了node8，这样做是否正确？

# pod.yaml
nodeName: node8

问题2

我通过kubectl describe nodes node8查看node8的信息，发现其中并没有tencent.com/vcuda-core和tencent.com/vcuda-memory这两个字段。这是不是造成GPU Manager没有起作用的原因？

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests       Limits
  --------           --------       ------
  cpu                17250m (71%)   18212m (75%)
  memory             31334Mi (48%)  35146Mi (54%)
  ephemeral-storage  0 (0%)         0 (0%)
  nvidia.com/gpu     1              1

问题3

我在Can't limit GPU utilization这个issue看到这样一个说法：nvidia-docker as container runtime will ruin the limitation function(@mYmNeo )，即使用nvidia-docker作为容器runtime会毁掉limitation函数，但是Readme里面并没有这种说法。请问是否需要修改runtime？

问题4

限制一块GPU的使用率是否需要单独安装vcuda-controller？

我的配置文件：

# pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: vcuda
  annotations:
    tencent.com/vcuda-core-limit: "50"
spec:
  restartPolicy: Never
  containers:
  - image: tensorflow/tensorflow:1.13.1
    name: nvidia
    command:
    - /usr/bin/nvidia-smi
    - pmon
    - -d
    - "10"
    resources:
      requests:
        tencent.com/vcuda-core: "50"
        tencent.com/vcuda-memory: "30"
      limits:
        tencent.com/vcuda-core: "50"
        tencent.com/vcuda-memory: "30"
  nodeName: node8

kubectl describe nodes node8的输出：

Name:               node8
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/hostname=node8
                    nvidia-device-enable=enable
                    nvidia.com/type=1080Ti
Annotations:        csi.volume.kubernetes.io/nodeid: {"cephfs.csi.ceph.com":"node8","rbd.csi.ceph.com":"node8"}
                    kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Wed, 04 Dec 2019 09:57:26 +0800
Taints:             <none>
Unschedulable:      false
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Tue, 13 Apr 2021 16:55:48 +0800   Mon, 12 Apr 2021 16:57:26 +0800   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Tue, 13 Apr 2021 16:55:48 +0800   Mon, 12 Apr 2021 16:57:26 +0800   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Tue, 13 Apr 2021 16:55:48 +0800   Mon, 12 Apr 2021 16:57:26 +0800   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Tue, 13 Apr 2021 16:55:48 +0800   Mon, 12 Apr 2021 16:57:26 +0800   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:  172.18.0.22
  Hostname:    node8
Capacity:
 cpu:                24
 ephemeral-storage:  204700Mi
 hugepages-1Gi:      0
 hugepages-2Mi:      0
 memory:             65879720Ki
 nvidia.com/gpu:     1
 pods:               110
Allocatable:
 cpu:                24
 ephemeral-storage:  193179156161
 hugepages-1Gi:      0
 hugepages-2Mi:      0
 memory:             65777320Ki
 nvidia.com/gpu:     1
 pods:               110
System Info:
 Machine ID:                 ef427f5b7f054701b7ac7bc12e5e49ec
 System UUID:                23bbab55-338b-11e7-9c43-bc0000de0000
 Boot ID:                    6414d1d2-891d-4fa6-b064-c24c0ada6b56
 Kernel Version:             4.19.46
 OS Image:                   CentOS Linux 7 (Core)
 Operating System:           linux
 Architecture:               amd64
 Container Runtime Version:  docker://17.3.2
 Kubelet Version:            v1.13.5
 Kube-Proxy Version:         v1.13.5
PodCIDR:                     172.16.6.0/24
Non-terminated Pods:         (41 in total)
  Namespace                  Name                                                       CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                  ----                                                       ------------  ----------  ---------------  -------------  ---
  default                    csi-cephfs-ceph-csi-cephfs-nodeplugin-dzf9t                0 (0%)        0 (0%)      0 (0%)           0 (0%)         78d
  default                    csi-rbd-ceph-csi-rbd-nodeplugin-gwm9d                      0 (0%)        0 (0%)      0 (0%)           0 (0%)         78d
  default                    nginx-ingress-controller-default-thbv9                     0 (0%)        0 (0%)      0 (0%)           0 (0%)         4d1h
  default                    qce-postgres-stolon-keeper-0                               0 (0%)        0 (0%)      0 (0%)           0 (0%)         125d
  default                    qce-postgres-stolon-proxy-789d97b757-r8nk5                 0 (0%)        0 (0%)      0 (0%)           0 (0%)         125d
  default                    qce-postgres-stolon-sentinel-594ddcfcbf-p7qrz              0 (0%)        0 (0%)      0 (0%)           0 (0%)         125d
  default                    spark-worker-1                                             2 (8%)        2 (8%)      4Gi (6%)         4Gi (6%)       146d
  default                    vcuda                                                      0 (0%)        0 (0%)      0 (0%)           0 (0%)         53m
  h986                       h986wiscan4-656f47bc4d-t2tnw                               1 (4%)        1 (4%)      4Gi (6%)         4Gi (6%)       50d
  kube-system                alert-apiserver-7b5c86d7ff-kd56t                           0 (0%)        0 (0%)      0 (0%)           0 (0%)         97d
  kube-system                calico-node-6zwzf                                          250m (1%)     0 (0%)      0 (0%)           0 (0%)         40d
  kube-system                calicotlb-compute-agent-n9n7g                              0 (0%)        0 (0%)      0 (0%)           0 (0%)         155d
  kube-system                coredns-547994f89-l46v9                                    100m (0%)     0 (0%)      70Mi (0%)        170Mi (0%)     83d
  kube-system                kube-proxy-xs4cb                                           0 (0%)        0 (0%)      0 (0%)           0 (0%)         496d
  kube-system                logkit-9clcj                                               100m (0%)     512m (2%)   128Mi (0%)       2Gi (3%)       496d
  kube-system                nvidia-device-plugin-daemonset-d6dqv                       0 (0%)        0 (0%)      0 (0%)           0 (0%)         496d
  kube-system                prometheus-operator-prometheus-node-exporter-7v9g9         100m (0%)     1 (4%)      256Mi (0%)       2Gi (3%)       83d
  kube-system                tiller-deploy-555696dfc8-k8rm2                             0 (0%)        0 (0%)      0 (0%)           0 (0%)         81d
  kube-system                volume-exporter-k94t6                                      0 (0%)        0 (0%)      0 (0%)           0 (0%)         83d
  mysql8re                   deploy-mysqlrenuc-lz03-yoc664wh-86765dfdb4-2pl2x           200m (0%)     200m (0%)   800Mi (1%)       800Mi (1%)     8d
  mysql8re                   mysqlnuc-cluster-portal-77f48bd666-l5r9m                   100m (0%)     100m (0%)   200Mi (0%)       200Mi (0%)     12d
  op                         mysqlnuc-operator-667ddbc585-rsqm7                         300m (1%)     300m (1%)   500Mi (0%)       500Mi (0%)     34d
  qce                        qce-postgres-stolon-keeper-1                               0 (0%)        0 (0%)      0 (0%)           0 (0%)         50d
  qce                        qce-postgres-stolon-proxy-65884b6cd4-nqsbm                 0 (0%)        0 (0%)      0 (0%)           0 (0%)         50d
  qce                        qce-postgres-stolon-sentinel-5b76db7bb-nvz2w               0 (0%)        0 (0%)      0 (0%)           0 (0%)         50d
  qiniu-mongors              deploy-mgors-mongo-6b96df7c49-kg74r                        200m (0%)     200m (0%)   800Mi (1%)       800Mi (1%)     15d
  qiniu-mongors              deploy-mgorsp-ke-mongors-cluster-7cf474cb8-8hlc7           100m (0%)     100m (0%)   200Mi (0%)       200Mi (0%)     81d
  qiniu-mongors              mongors-operator-65df599b-kt5xb                            300m (1%)     300m (1%)   500Mi (0%)       500Mi (0%)     120d
  qiniu-mysql                statefulset-mysqlf-0bkopgdta860g0gm3a21-2                  1300m (5%)    1300m (5%)  2124Mi (3%)      2124Mi (3%)    8d
  qiniu-mysql                statefulset-mysqlf-ke-mysql-cluster-1                      500m (2%)     500m (2%)   1600Mi (2%)      1600Mi (2%)    7d4h
  qiniu-mysql                statefulset-mysqlf-vqo1vfeta860g0of3hq1-2                  1300m (5%)    1300m (5%)  2124Mi (3%)      2124Mi (3%)    22h
  qiniu-rabbitmq             statefulset-rmqi-ke-rabbitmq-cluster-0                     500m (2%)     500m (2%)   1300Mi (2%)      1300Mi (2%)    82d
  qiniu-redis                deploy-redisportal-0bhkp2eta860g0gm3ae0-d8898fc86-v85mk    100m (0%)     100m (0%)   200Mi (0%)       200Mi (0%)     36d
  qiniu-redis                redisdata-operator-cdd96dd96-wnrqt                         300m (1%)     300m (1%)   500Mi (0%)       500Mi (0%)     20h
  qiniu-redis                statefulset-redis-ke-redis-cluster-stl-1                   500m (2%)     500m (2%)   1600Mi (2%)      1600Mi (2%)    40h
  test                       emqx-test-1                                                1 (4%)        1 (4%)      1Gi (1%)         1Gi (1%)       49d
  test                       fortune                                                    2 (8%)        2 (8%)      2Gi (3%)         2Gi (3%)       41d
  test                       mysql-7888cff686-kjrp8                                     1 (4%)        1 (4%)      1Gi (1%)         1Gi (1%)       50d
  test                       web-statefulset-0                                          1 (4%)        1 (4%)      1Gi (1%)         1Gi (1%)       35d
  xcan                       mysql2-6bbdf7c49c-sh5w8                                    2 (8%)        2 (8%)      4Gi (6%)         4Gi (6%)       81d
  xcan                       xcanopnapi-5dfd86c74b-gxsnh                                1 (4%)        1 (4%)      1Gi (1%)         1Gi (1%)       50d
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests       Limits
  --------           --------       ------
  cpu                17250m (71%)   18212m (75%)
  memory             31334Mi (48%)  35146Mi (54%)
  ephemeral-storage  0 (0%)         0 (0%)
  nvidia.com/gpu     1              1
Events:              <none>

运行环境：

kubernetes：v1.13.5
docker：17.03.2-ce
tensorflow：1.13.1
显卡：GTX 1080ti
CUDA：10.0

问题 1：没问题问题 2：没有 vcuda 的字样，说明 gpu-manager 没有正常工作问题 3：README 里

To compare with the combination solution of nvidia-docker and nvidia-k8s-plugin, GPU manager will use native runc without modification but nvidia solution does. Besides we also support metrics report without deploying new components. gpu-manager 用 native runc，不是 nvidia-container-runtime 的 runc，修改了之后会造成工作异常

问题 4：gpu-manager 里面内置了 vcuda-controller

@mYmNeo 感谢回复！我通过修改/etc/docker/daemon.json从而使用native runc，daemon.json如下：

{
  "log-level": "debug",
  "live-restore": true,
  "icc": false,
  "storage-driver": "overlay",
  "insecure-registries": ["qce-reg.nucpoc.com"],
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "512m",
    "max-file": "3"
  }
}

然后重启docker，systemctl daemon-reload && systemctl restart docker kubelet。

再次生成Pod后，Pod的状态是Error：

[root@master yamls]# kubectl get pods | grep vcuda
vcuda                                           0/1     Error     0          99s

[root@master yamls]# kubectl describe pod/vcuda
Name:               vcuda
Namespace:          default
Priority:           0
PriorityClassName:  <none>
Node:               node8/172.18.0.22
Start Time:         Mon, 19 Apr 2021 15:17:02 +0800
Labels:             <none>
Annotations:        tencent.com/vcuda-core-limit: 50
Status:             Failed
IP:                 172.16.225.106
Containers:
  nvidia:
    Container ID:  docker://6034ea380768d57347c4b1f32405d57cb6e7109edfb02c7b9280ef97e109650f
    Image:         tensorflow/1.13.1:new
    Image ID:      docker://sha256:f96f1993a92ce7bacda23b6c52e46d9912ce2ecea49a57e054befc106b422f48
    Port:          <none>
    Host Port:     <none>
    Command:
      /usr/bin/nvidia-smi
      pmon
      -d
      10
    State:          Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Mon, 19 Apr 2021 15:17:43 +0800
      Finished:     Mon, 19 Apr 2021 15:17:43 +0800
    Ready:          False
    Restart Count:  0
    Limits:
      tencent.com/vcuda-core:    50
      tencent.com/vcuda-memory:  30
    Requests:
      tencent.com/vcuda-core:    50
      tencent.com/vcuda-memory:  30
    Environment:                 <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-fxmvd (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  default-token-fxmvd:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-fxmvd
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type    Reason   Age   From            Message
  ----    ------   ----  ----            -------
  Normal  Pulled   7s    kubelet, node8  Container image "tensorflow/1.13.1:new" already present on machine
  Normal  Created  7s    kubelet, node8  Created container
  Normal  Started  6s    kubelet, node8  Started container

请问有没有办法排查出现Error的原因？谢谢！

tkestack / gpu-manager

未能实现限制GPU显存和GPU使用率 #78

问题1

问题2

问题3

问题4