volcano-sh / devices

Device plugins for Volcano, e.g. GPU
Apache License 2.0
103 stars 43 forks source link

Failure of Device Plugin Communication with Kubernetes Kubelet #59

Closed MondayCha closed 5 months ago

MondayCha commented 7 months ago

Description:

The Kubernetes kubelet service on the host is encountering intermittent failures in communicating with the device plugin, specifically regarding the volcano.sh/vgpu-number resource.

Description for Node

Allocatable:
  volcano.sh/gpu-memory:   0
  volcano.sh/gpu-number:   2
  volcano.sh/vgpu-number:  8

Kubelet Logs

Mar 19 14:31:09 dell-63 kubelet[1067]: E0319 14:31:09.987111    1067 endpoint.go:107] "listAndWatch ended unexpectedly for device plugin" err="rpc error: code = Unavailable desc = error reading from server: EOF" resourceName="volcano.sh/vgpu-number"
Mar 19 14:31:09 dell-63 kubelet[1067]: W0319 14:31:09.987121    1067 clientconn.go:1326] [core] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/device-plugins/nvidia-gpu.sock /var/lib/kubelet/device-plugins/nvidia-gpu.sock <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/device-plugins/nvidia-gpu.sock: connect: connection refused". Reconnecting...

Pod Events

Events:
  Type     Reason                    Age   From               Message
  ----     ------                    ----  ----               -------
  Normal   Scheduled                 12s   default-scheduler  Successfully assigned crater-jobs/gpu-pod11111 to dell-63
  Warning  UnexpectedAdmissionError  12s   kubelet            Allocate failed due to rpc error: code = Unavailable desc = error reading from server: EOF, which is unexpected

Plugin Logs

 I0319 14:51:12.329385       1 main.go:77] Loading NVML
 I0319 14:51:12.356024       1 main.go:91] Starting FS watcher.
 I0319 14:51:12.356095       1 main.go:98] Starting OS watcher.
 I0319 14:51:12.369951       1 main.go:116] Retreiving plugins.
 I0319 14:51:12.370235       1 register.go:101] into WatchAndRegister
 2024/03/19 14:51:12 Starting GRPC server for 'volcano.sh/vgpu-number'
 2024/03/19 14:51:12 Starting to serve 'volcano.sh/vgpu-number' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
 2024/03/19 14:51:12 Registered device plugin for 'volcano.sh/vgpu-number' with Kubelet
 I0319 14:51:12.397816       1 register.go:89] Reporting devices GPU-01020ff8-5d55-31c9-9d5d-e433bef1579d,4,11264,NVIDIA-NVIDIA GeForce RTX 2080 Ti,false:GPU-50dbcaf2-0289-c862-c69c-5c0a967021a1,4,11264,NVIDIA-NVIDIA GeForce RTX 2080 Ti,false: in 2024-03-19 14:51:12.397800845 +0000 UTC m=+0.077900234
 I0319 14:51:42.431339       1 register.go:89] Reporting devices GPU-01020ff8-5d55-31c9-9d5d-e433bef1579d,4,11264,NVIDIA-NVIDIA GeForce RTX 2080 Ti,false:GPU-50dbcaf2-0289-c862-c69c-5c0a967021a1,4,11264,NVIDIA-NVIDIA GeForce RTX 2080 Ti,false: in 2024-03-19 14:51:42.43132017 +0000 UTC m=+30.111419558
 I0319 14:52:12.577834       1 register.go:89] Reporting devices GPU-01020ff8-5d55-31c9-9d5d-e433bef1579d,4,11264,NVIDIA-NVIDIA GeForce RTX 2080 Ti,false:GPU-50dbcaf2-0289-c862-c69c-5c0a967021a1,4,11264,NVIDIA-NVIDIA GeForce RTX 2080 Ti,false: in 2024-03-19 14:52:12.577817165 +0000 UTC m=+60.257916570

Socket appears not to start successfully, without error messages?