Closed cailun01 closed 3 years ago
问题 1:没问题 问题 2:没有 vcuda 的字样,说明 gpu-manager 没有正常工作 问题 3:README 里
To compare with the combination solution of nvidia-docker and nvidia-k8s-plugin, GPU manager will use native runc without modification but nvidia solution does. Besides we also support metrics report without deploying new components. gpu-manager 用 native runc,不是 nvidia-container-runtime 的 runc,修改了之后会造成工作异常
问题 4:gpu-manager 里面内置了 vcuda-controller
问题 1:没问题 问题 2:没有 vcuda 的字样,说明 gpu-manager 没有正常工作 问题 3:README 里
To compare with the combination solution of nvidia-docker and nvidia-k8s-plugin, GPU manager will use native runc without modification but nvidia solution does. Besides we also support metrics report without deploying new components. gpu-manager 用 native runc,不是 nvidia-container-runtime 的 runc,修改了之后会造成工作异常
问题 4:gpu-manager 里面内置了 vcuda-controller
@mYmNeo 感谢回复!
我通过修改/etc/docker/daemon.json
从而使用native runc
,daemon.json如下:
{
"log-level": "debug",
"live-restore": true,
"icc": false,
"storage-driver": "overlay",
"insecure-registries": ["qce-reg.nucpoc.com"],
"log-driver": "json-file",
"log-opts": {
"max-size": "512m",
"max-file": "3"
}
}
然后重启docker,systemctl daemon-reload && systemctl restart docker kubelet
。
再次生成Pod后,Pod的状态是Error:
[root@master yamls]# kubectl get pods | grep vcuda
vcuda 0/1 Error 0 99s
[root@master yamls]# kubectl describe pod/vcuda
Name: vcuda
Namespace: default
Priority: 0
PriorityClassName: <none>
Node: node8/172.18.0.22
Start Time: Mon, 19 Apr 2021 15:17:02 +0800
Labels: <none>
Annotations: tencent.com/vcuda-core-limit: 50
Status: Failed
IP: 172.16.225.106
Containers:
nvidia:
Container ID: docker://6034ea380768d57347c4b1f32405d57cb6e7109edfb02c7b9280ef97e109650f
Image: tensorflow/1.13.1:new
Image ID: docker://sha256:f96f1993a92ce7bacda23b6c52e46d9912ce2ecea49a57e054befc106b422f48
Port: <none>
Host Port: <none>
Command:
/usr/bin/nvidia-smi
pmon
-d
10
State: Terminated
Reason: Error
Exit Code: 1
Started: Mon, 19 Apr 2021 15:17:43 +0800
Finished: Mon, 19 Apr 2021 15:17:43 +0800
Ready: False
Restart Count: 0
Limits:
tencent.com/vcuda-core: 50
tencent.com/vcuda-memory: 30
Requests:
tencent.com/vcuda-core: 50
tencent.com/vcuda-memory: 30
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-fxmvd (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
default-token-fxmvd:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-fxmvd
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulled 7s kubelet, node8 Container image "tensorflow/1.13.1:new" already present on machine
Normal Created 7s kubelet, node8 Created container
Normal Started 6s kubelet, node8 Started container
请问有没有办法排查出现Error的原因?谢谢!
再请问,是否需要单独安装gpu-admission?
您好!我尝试用tensorflow运行了resnet50模型,结果发现未能限制GPU显存和GPU使用率,GPU显存和使用率几乎被占满。
nvidia-smi
的结果如下:请教几个问题。
问题1
我的集群中有多个节点,其中master节点没有GPU,只有node8节点有GPU。因此在yaml文件中通过nodeName字段指定了node8,这样做是否正确?
问题2
我通过
kubectl describe nodes node8
查看node8的信息,发现其中并没有tencent.com/vcuda-core
和tencent.com/vcuda-memory
这两个字段。这是不是造成GPU Manager
没有起作用的原因?问题3
我在Can't limit GPU utilization这个issue看到这样一个说法:
nvidia-docker as container runtime will ruin the limitation function
(@mYmNeo ),即使用nvidia-docker
作为容器runtime会毁掉limitation函数,但是Readme
里面并没有这种说法。请问是否需要修改runtime?问题4
限制一块GPU的使用率是否需要单独安装
vcuda-controller
?我的配置文件:
kubectl describe nodes node8
的输出:运行环境: