The Kubernetes kubelet service on the host is encountering intermittent failures in communicating with the device plugin, specifically regarding the volcano.sh/vgpu-number resource.
Mar 19 14:31:09 dell-63 kubelet[1067]: E0319 14:31:09.987111 1067 endpoint.go:107] "listAndWatch ended unexpectedly for device plugin" err="rpc error: code = Unavailable desc = error reading from server: EOF" resourceName="volcano.sh/vgpu-number"
Mar 19 14:31:09 dell-63 kubelet[1067]: W0319 14:31:09.987121 1067 clientconn.go:1326] [core] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/device-plugins/nvidia-gpu.sock /var/lib/kubelet/device-plugins/nvidia-gpu.sock <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/device-plugins/nvidia-gpu.sock: connect: connection refused". Reconnecting...
Pod Events
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 12s default-scheduler Successfully assigned crater-jobs/gpu-pod11111 to dell-63
Warning UnexpectedAdmissionError 12s kubelet Allocate failed due to rpc error: code = Unavailable desc = error reading from server: EOF, which is unexpected
Plugin Logs
I0319 14:51:12.329385 1 main.go:77] Loading NVML
I0319 14:51:12.356024 1 main.go:91] Starting FS watcher.
I0319 14:51:12.356095 1 main.go:98] Starting OS watcher.
I0319 14:51:12.369951 1 main.go:116] Retreiving plugins.
I0319 14:51:12.370235 1 register.go:101] into WatchAndRegister
2024/03/19 14:51:12 Starting GRPC server for 'volcano.sh/vgpu-number'
2024/03/19 14:51:12 Starting to serve 'volcano.sh/vgpu-number' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2024/03/19 14:51:12 Registered device plugin for 'volcano.sh/vgpu-number' with Kubelet
I0319 14:51:12.397816 1 register.go:89] Reporting devices GPU-01020ff8-5d55-31c9-9d5d-e433bef1579d,4,11264,NVIDIA-NVIDIA GeForce RTX 2080 Ti,false:GPU-50dbcaf2-0289-c862-c69c-5c0a967021a1,4,11264,NVIDIA-NVIDIA GeForce RTX 2080 Ti,false: in 2024-03-19 14:51:12.397800845 +0000 UTC m=+0.077900234
I0319 14:51:42.431339 1 register.go:89] Reporting devices GPU-01020ff8-5d55-31c9-9d5d-e433bef1579d,4,11264,NVIDIA-NVIDIA GeForce RTX 2080 Ti,false:GPU-50dbcaf2-0289-c862-c69c-5c0a967021a1,4,11264,NVIDIA-NVIDIA GeForce RTX 2080 Ti,false: in 2024-03-19 14:51:42.43132017 +0000 UTC m=+30.111419558
I0319 14:52:12.577834 1 register.go:89] Reporting devices GPU-01020ff8-5d55-31c9-9d5d-e433bef1579d,4,11264,NVIDIA-NVIDIA GeForce RTX 2080 Ti,false:GPU-50dbcaf2-0289-c862-c69c-5c0a967021a1,4,11264,NVIDIA-NVIDIA GeForce RTX 2080 Ti,false: in 2024-03-19 14:52:12.577817165 +0000 UTC m=+60.257916570
Socket appears not to start successfully, without error messages?
Description:
The Kubernetes kubelet service on the host is encountering intermittent failures in communicating with the device plugin, specifically regarding the volcano.sh/vgpu-number resource.
Description for Node
Kubelet Logs
Pod Events
Plugin Logs
Socket appears not to start successfully, without error messages?