Open timozerrer opened 3 years ago
After upgrading the scheduler to tkestack/gpu-manager:v1.1.4
, pods are allocated but crashing with:
Error: device plugin PreStartContainer rpc failed with err: rpc error: code = Unknown desc = PreStartContainer check failed, failed to read from checkpoint file due to json: cannot unmarshal object into Go struct field PodDevicesEntry.Data.PodDeviceEntries.DeviceIDs of type []string
It looks like scheduler gpu-manager
failed, In my situation restart will be helpful.
After upgrading the scheduler to
tkestack/gpu-manager:v1.1.4
, pods are allocated but crashing with:Error: device plugin PreStartContainer rpc failed with err: rpc error: code = Unknown desc = PreStartContainer check failed, failed to read from checkpoint file due to json: cannot unmarshal object into Go struct field PodDevicesEntry.Data.PodDeviceEntries.DeviceIDs of type []string
Same issue.Have u solved it?
Hello,
Pod is pending and fails to be scheduled properly. I deployed gpu-manager and gpu-admission controller. Do i need any nvidia deployments or only cuda / graphics driver?
Docker: 20.10.6 Kubelet:1.21.0
Deployment.yaml: ` apiVersion: apps/v1 kind: Deployment
metadata: name: mnist-test labels: app: mnist-test spec: replicas: 1
selector: # define how the deployment finds the pods it manages matchLabels: app: mnist-test
template: # define the pods specifications metadata: labels: app: mnist-test
`
gpu-manager logs:
copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-ml.so to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-ml.so.1 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-ml.so.465.19.01 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libcuda.so to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libcuda.so.465.19.01 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libcuda.so.1 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.465.19.01 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-opencl.so.1 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-opencl.so.465.19.01 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-compiler.so.465.19.01 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia.so.465.19.01 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia.so to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia.so.1 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-encode.so.1 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-encode.so to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-encode.so.465.19.01 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvcuvid.so.1 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvcuvid.so.465.19.01 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvcuvid.so to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-fbc.so.465.19.01 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-fbc.so.1 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-fbc.so to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-ifr.so to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-ifr.so.1 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-ifr.so.465.19.01 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGL.so.1.7.0 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGL.so.1 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGLX.so.0.0.0 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGLX.so.0 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libOpenGL.so.0.0.0 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libOpenGL.so.0 to /usr/local/nvidia/lib64 copy /usr/local/host/share/code/swiftshader/libGLESv2.so to /usr/local/nvidia/lib64 copy /usr/local/host/share/code/libGLESv2.so to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGLESv2.so.2.1.0 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGLESv2.so.2 to /usr/local/nvidia/lib64 copy /usr/local/host/share/code/libEGL.so to /usr/local/nvidia/lib64 copy /usr/local/host/share/code/swiftshader/libEGL.so to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libEGL.so.1.1.0 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libEGL.so.1 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGLdispatch.so.0.0.0 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGLdispatch.so.0 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGLX_nvidia.so.0 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGLX_nvidia.so.465.19.01 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libEGL_nvidia.so.0 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libEGL_nvidia.so.465.19.01 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.465.19.01 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.2 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.465.19.01 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.1 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-eglcore.so.465.19.01 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-glcore.so.465.19.01 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-tls.so.465.19.01 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-glsi.so.465.19.01 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.1 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-opticalflow.so to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.465.19.01 to /usr/local/nvidia/lib64 copy /usr/local/host/bin/nvidia-cuda-mps-control to /usr/local/nvidia/bin/ copy /usr/local/host/bin/nvidia-cuda-mps-server to /usr/local/nvidia/bin/ copy /usr/local/host/bin/nvidia-debugdump to /usr/local/nvidia/bin/ copy /usr/local/host/bin/nvidia-persistenced to /usr/local/nvidia/bin/ copy /usr/local/host/bin/nvidia-smi to /usr/local/nvidia/bin/ rebuild ldcache launch gpu manager E0508 13:51:05.283586 9794 server.go:132] Unable to set Type=notify in systemd service file?
Node description: