I0525 09:17:52.971037 30026 main.go:83] Server starting on 127.0.0.1:3456
I0525 09:17:52.971191 30026 reflector.go:175] Starting reflector *v1.Pod (30s) from pkg/mod/k8s.io/client-go@v0.18.12/tools/cache/reflector.go:125
I0525 09:17:52.971196 30026 reflector.go:175] Starting reflector *v1.Node (30s) from pkg/mod/k8s.io/client-go@v0.18.12/tools/cache/reflector.go:125
I0525 09:17:52.971288 30026 reflector.go:211] Listing and watching *v1.Node from pkg/mod/k8s.io/client-go@v0.18.12/tools/cache/reflector.go:125
I0525 09:17:52.971264 30026 reflector.go:211] Listing and watching *v1.Pod from pkg/mod/k8s.io/client-go@v0.18.12/tools/cache/reflector.go:125
I0525 09:25:54.025718 30026 reflector.go:496] pkg/mod/k8s.io/client-go@v0.18.12/tools/cache/reflector.go:125: Watch close - *v1.Node total 480 items received
I0525 09:27:35.164305 30026 reflector.go:496] pkg/mod/k8s.io/client-go@v0.18.12/tools/cache/reflector.go:125: Watch close - *v1.Pod total 1207 items received
I0525 09:34:13.027364 30026 reflector.go:496] pkg/mod/k8s.io/client-go@v0.18.12/tools/cache/reflector.go:125: Watch close - *v1.Node total 496 items received
I0525 09:34:46.165966 30026 reflector.go:496] pkg/mod/k8s.io/client-go@v0.18.12/tools/cache/reflector.go:125: Watch close - *v1.Pod total 889 items received
# kubectl logs pod/vcuda
container_linux.go:247: starting container process caused "exec: \"/usr/local/nvidia/bin/nvidia-smi\": stat /usr/local/nvidia/bin/nvidia-smi: no such file or directory"
您好,我近期一直试图安装gpu manager,但是没有成功。生成的pod找不到nvidia-smi,相关节点没有
tencent.com/vcuda-core
和tencent.com/vcuda-memory
这两个字段。我的测试环境:
我的master节点没有GPU,node8节点有GPU。按照Readme要求,保证node8节点的docker runtime使用native runc,而不是nvidia-container-runtime,node8的daemon.json如下:
在master节点安装gpu-admission。scheduler-policy-config.json为:
kube-scheduler.yaml为:
成功安装了gpu-admission,并在master节点运行:
日志:
安装gpu-manager。
因为我的docker版本比较旧(17.03),不支持multi-stage build,所以对gpu-manager给的Dockerfile略作修改,一次生成镜像:
编译生成了tkestack/gpu-manager镜像
接下来准备生成pod,yaml文件如下:
但是,创建的vcuda这个pod找不到nvidia-smi。
把
/usr/local/nvidia/bin/nvidia-smi
改为nvidia-smi
或/usr/bin/nvidia-smi
也同样找不到。另外,我查看node8节点的状态,发现并没有
tencent.com/vcuda-core
和tencent.com/vcuda-memory
这两个字段。请问我的操作步骤哪里有问题?谢谢!